GENOMIC PLANT SEQUENCES AND USES THEREOF 
REFERENCES TO RELATED APPLICATIONS 
This application claims priority under 35 U.S.C. §120 of U.S. Application Serial 
Nos. 09/620,392, filed July 19, 2000; 09/702,134, filed October 31, 2000, the disclosures 
of which applications are incorporated herein by reference in their entirety. 

INCORPORATION OF SEQUENCE LISTING 
Two copies of the sequence listing (Copy 1 and Copy 2) and a computer readable 
form of the sequence listing, all on CD-ROMs, each containing the file named 
Pa2_00329.txt which is 420,819,499 bytes (measured in MS-DOS) and was created on 
March 23, 2001, are herein incorporated by reference. 

INCORPORATION OF TABLES 1, 3, 4, 5 and 6. 
Two copies of Tables 1, 3, 4, 5, and 6 on CD-ROMs, each containing 
47,041,202 bytes (measured in MS-DOS) and each having the file name Pa_00329.txt all 
created on March 16, 2001 , are herein incorporated by reference. 

FIELD OF THE INVENTION 
The present invention relates to the field of plant biochemistry and genetics. 
Specifically, the invention relates to regulatory elements comprising genomic nucleic 
acid sequences from rice plants, and nucleic acid molecules containing the same. More 
specifically, the invention discloses nucleic acid sequences from Oryza sativa (rice) 
containing regulatory elements, such as promoters. The invention also discloses methods 



of modifying, producing, and using the regulatory elements. 

BACKGROUND OF THE INVENTION 

Promoters 

The genetic enhancement of plants and seeds provides significant benefits to 
society. For example, plants and seeds may be enhanced to have desirable agricultural, 
biosynthetic, commercial, chemical, insecticidal, industrial, nutritional, or pharmaceutical 
properties. Despite the availability of many molecular tools, however, the genetic 
modification of plants and seeds is often constrained by an insufficient or poorly 
localized expression of the engineered transgene. 

Many intracellular processes may impact overall transgene expression, including 
transcription, translation, protein assembly and folding, methylation, phosphorylation, 
transport, and proteolysis. Intervention in one or more of these processes can increase the 
amount of transgene expression in genetically engineered plants and seeds. For example, 
raising the steady-state level of mRNA in the cytosol often yields an increased 
accumulation of transgene expression. Many factors may contribute to increasing the 
steady-state level of an mRNA in the cytosol, including the rate of transcription, promoter 
strength and other regulatory features of the promoter, efficiency of mRNA processing, 
and the overall stability of the mRNA. 

Among these factors, the promoter plays a central role. Along the promoter, the 
transcription machinery is assembled and transcription is initiated. This early step is 
often rate-limiting relative to subsequent stages of protein production. Transcription 
initiation at the promoter may be regulated in several ways. For example, a promoter 



may be induced by the presence of a particular compound or external stimuli, express a 
gene only in a specific tissue, express a gene during a specific stage of development, or 
constitutively express a gene. Thus, transcription of a transgene may be regulated by 
operably linking the coding sequence to promoters with different regulatory 
characteristics. Accordingly, regulatory elements such as promoters, play a pivotal role 
in enhancing the agronomic, pharmaceutical or nutritional value of crops. 

At least two types of information are useful in predicting promoter regions within 
a genomic DNA sequence. First, promoters may be identified on the basis of their 
sequence "content," such as transcription factor binding sites and various known 
promoter motifs. (Stormo, Genome Research 10: 394-397 (2000)). Such signals may be 
identified by computer programs that identify sites associated with promoters, such as 
TATA boxes, transcription factor (TF) binding sites, and CpG islands. 

Second, promoters may be identified on the basis of their "location," i.e. their 
proximity to a known or suspected coding sequence. (Stormo, Genome Research 10: 
394-397 (2000)). Promoters are typically contained within a region of DNA extending 
approximately 150-1500 basepairs in the 5' direction from the start codon of a coding 
sequence. Thus, promoter regions may be identified by locating the start codon of a 
coding sequence, and moving beyond the start codon in the 5' direction to locate the 
promoter region. 
Rice 

Approximately half a billion tons of rice is produced each year world-wide. More 
than 90% of this rice is for human consumption (Goff, Curr Opin. Plant Biol 2:86-89 
(1999)). Rice, however, is not only a commercially important crop; it is also a model for 



other cereal crops, such as sorghum, maize, barley and wheat. 

Rice is a model crop for several reasons. First, the genes in rice are predicted to 
be generally arranged in the genome in an order that is similar to other cereal crops. In 
fact, comparisons of the physical and genetic maps of cereal genomes have suggested the 
existence of a colinearity of gene order among the various cereal genomes studied. (Goff, 
Curr. Opin. Plant Biol. 2:86-89 (1999)). 

Second, studies of a number of individual genes indicate that there is considerable 
homology within gene families found among various cereal crops. This conservation of 
gene and protein sequences suggests that functional studies of genes or proteins from one 
cereal crop can help elucidate the function of similar genes or proteins in other cereal 
crops. Likewise, non-coding regulatory elements in rice, such as promoters, are predicted 
to display similar functions compared to related regulatory elements found in other cereal 
crops. Accordingly, a strong constitutive or tissue-specific promoters from one cereal is 
more likely to retain its function when introduced as a portion of a transgene into another 
cereal crop species (Goff, Curr. Opin. Plant Biol. 2:86-89 (1999). 

Third, rice can be used as a model for other cereal genomes because its genome is 
smaller than those of other major cereals. The size of the rice genome is estimated at 420 
to 450 megabase pairs. Sorghum, maize, barley and wheat have larger genomes (1000, 
3000, 5000 and 16000 Mbp, respectively). Despite such differences in genome size, 
however, the number of genes in each of these crops is on the same order of magnitude. 
Thus, the smaller genome size of rice results in a higher gene density relative to the other 
cereals. Based on estimates of 30,000 genes in a cereal genome, rice will have on 
average one gene approximately every 1 5 Kbp. In contrast, maize and wheat have one 



gene approximately every 100 and 500 Kbp, respectively. This higher gene density 
makes rice an attractive target for cereal gene discovery efforts, genomic sequence 
analysis, and identification of regulatory elements, such as promoters (Goff, Curr. Opin. 
Plant Biol. 2:86-89 (1999)). 

For these reasons, rice is a model for other crops. Accordingly, discoveries in rice 
may be extended to other crops. Thus, the identification of new genes, regulatory 
elements (e.g., promoters), etc. that function in rice is useful not only in developing 
enhanced varieties of rice, but also in developing enhanced varieties of other crops. In 
particular, developments in rice are applicable to other cereal crops, such as sorghum, 
maize, barley and wheat. 

Clearly, there exists a need in the art for new regulatory elements, such as 
promoters, that are capable of expressing heterologous nucleic acid sequences in 
important crop species. 

SUMMARY OF THE INVENTION 
The present invention includes and provides a substantially purified nucleic acid 
molecule comprising a nucleic acid sequence wherein the nucleic acid sequence: i) 
hybridizes under stringent conditions with a sequence selected from the group consisting 
of SEQ ID NO:l through 57,467, and the complements thereof; or ii) exhibits an 85% or 
greater identity to a sequence selected from the group consisting of SEQ ID NO: 1 
through 57,467. 

The present invention includes and provides a transgenic plant containing a 
nucleic acid molecule that comprises in the 5' to 3' direction: a nucleic acid sequence 
that: i) hybridizes under stringent conditions with a sequence selected from the group 



consisting of SEQ ID NO:l through 57,467, and the complements thereof; or ii) exhibits 
an 85% or greater identity to a sequence selected from the group consisting of SEQ ID 
NO:l through 57,467; operably linked to a structural nucleic acid sequence; wherein the 
nucleic acid sequence is heterologous with respect to the structural nucleic acid sequence. 

The present invention includes and provides a seed from a transgenic plant 
containing a nucleic acid molecule that comprises in the 5' to 3' direction: a nucleic acid 
sequence that: i) hybridizes under stringent conditions with a sequence selected from the 
group consisting of SEQ ID NO: 1 through 57,467, and the complements thereof; or ii) 
exhibits an 85% or greater identity to a sequence selected from the group consisting of 
SEQ ID NO:l through 57,467; operably linked to a structural nucleic acid sequence; 
wherein the nucleic acid sequence is heterologous with respect to the structural nucleic 
acid sequence. 

The present invention includes and provides a fertile transgenic plant containing a 
nucleic acid molecule that comprises in the 5' to 3' direction: a nucleic acid sequence 
that: i) hybridizes under stringent conditions with a sequence selected from the group 
consisting of SEQ ID NO:l through 57,467, and the complements thereof; or ii) exhibits 
an 85% or greater identity to a sequence selected from the group consisting of SEQ ID 
NO:l through 57,467; operably linked to a structural nucleic acid sequence; wherein the 
nucleic acid sequence is heterologous with respect to the structural nucleic acid sequence. 

The present invention includes and provides a method of transforming a host cell 
comprising: a) providing a nucleic acid molecule that comprises in the 5' to 3 5 direction: 
a nucleic acid sequence that: i) hybridizes under stringent conditions with a sequence 
selected from the group consisting of SEQ ID NO: 1 through 57,467, and the 



complements thereof; or ii) exhibits an 85% or greater identity to a sequence selected 



from the group consisting of SEQ ID NO:l through 57,467; operably linked to 



a 



structural nucleic acid sequence; and b) transforming said plant with the nucleic acid 



molecule. 



DEFINITIONS 



a 



The following definitions are provided as an aid to understanding the detailed 
description of the present invention. 

The phrases "coding sequence," "structural sequence," and "structural nucleic 
acid sequence" refer to a physical structure comprising an orderly arrangement of nucleic 
acids. The nucleic acids are arranged in a series of nucleic acid triplets that each form 
codon. Each codon encodes for a specific amino acid. Thus the coding sequence, 
structural sequence, and structural nucleic acid sequence encode a series of amino acids 
forming a protein, polypeptide, or peptide sequence. The coding sequence, structural 
sequence, and structural nucleic acid sequence may be contained, without limitation, 
within a larger nucleic acid molecule, vector, etc. In addition, the orderly arrangement of 
nucleic acids in these sequences may be depicted, without limitation, in the form of a 
sequence listing, figure, table, electronic medium, etc. 

The phrases "DNA sequence" and "nucleic acid sequence" refer to a physical 
structure comprising an orderly arrangement of nucleic acids. The DNA sequence or 
nucleic acid sequence may be contained within a larger nucleic acid molecule, vector, or 
the like. In addition, the orderly arrangement of nucleic acids in these sequences may be 
depicted in the form of a sequence listing, figure, table, electronic medium, or the like. 

The term "expression" refers to the transcription of a gene to produce the 



corresponding mRNA and translation of this mRNA to produce the corresponding gene 
product (i.e., a peptide, polypeptide, or protein) and activity of the protein to confer a 
function. 

The term "expression of antisense RNA" refers to the transcription of a DNA to 
produce a first RNA molecule capable of hybridizing to a second RNA molecule. 

The term "gene" refers to chromosomal or genomic DNA, plasmid DNA, cDNA, 
synthetic DNA, or other DNA that encodes a peptide, polypeptide, protein, or RNA 
molecule. 

"Homology" refers to the level of similarity between two or more nucleic acid or 
amino acid sequences in terms of percent of positional identity (i.e., sequence similarity 
or identity). Homology also refers to the concept of similar functional properties among 
different nucleic acids or proteins. 

The phrase "heterologous" refers to the relationship between two or more nucleic 
acid or protein sequences that are derived from different sources. For example, a 
promoter is heterologous with respect to a coding sequence if such a combination is not 
normally found in nature. In addition, a particular sequence may be "heterologous" with 
respect to a cell or organism into which it is inserted (i.e. does not naturally occur in that 
particular cell or organism). 

The term "hybridization" refers generally to the ability of nucleic acid molecules 
to join via complementary base strand pairing. Such hybridization may occur when 
nucleic acid molecules are contacted under appropriate conditions (see also, "specific 
hybridization," below). 

The phrase "operably linked" refers to the functional spatial arrangement of two 
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or more nucleic acid regions or nucleic acid sequences. For example, a promoter region 
may be positioned relative to a nucleic acid sequence such that transcription of the 
nucleic acid sequence is directed by the promoter region. Thus, the promoter region is 
"operably linked" to the nucleic acid sequence. 

The term "promoter," "promoter region," or "promoter sequence" refer to a 
nucleic acid sequence, usually found upstream (5') to a coding sequence, that directs 
transcription of a nucleic acid sequence into mRNA. The promoter or promoter region 
typically provide a recognition site for RNA polymerase and the other factors necessary 
for proper initiation of transcription. As contemplated herein, a promoter or promoter 
region includes variations of promoters derived by inserting or deleting regulatory 
regions, subjecting the promoter to random or site-directed mutagenesis, etc. The activity 
or strength of a promoter may be measured in terms of the amounts of RNA it produces, 
or the amount of protein accumulation in a cell or tissue, relative to a promoter whose 
transcriptional activity has been previously assessed. 

The term "recombinant vector" refers to any agent such as a plasmid, cosmid, 
virus, autonomously replicating sequence, phage, or linear or circular single-stranded or 
double-stranded DNA or RNA nucleotide sequence. The recombinant vector may be 
derived from any source; is capable of genomic integration or autonomous replication; 
and comprises a promoter nucleic acid sequence operably linked to one or more nucleic 
acid sequences. A recombinant vector is typically used to introduce such operably linked 
sequences into a suitable host. 

"Regulatory sequence" refers to a nucleotide sequence located upstream (5'), 
within, or downstream (3') to a coding sequence. Transcription and expression of the 



coding sequence is typically impacted by the presence or absence of the regulatory 
sequence. 

"Specifically hybridizes" refers to the ability of two nucleic acid molecules to 
form an anti-parallel, double-stranded nucleic acid structure. A nucleic acid molecule is 
said to be the "complement" of another nucleic acid molecule if they exhibit "complete 
complementarity," i.e., each nucleotide in one sequence is complementary to its base 
pairing partner nucleotide in another sequence. Two molecules are said to be "minimally 
complementary" if they can hybridize to one another with sufficient stability to permit 
them to remain annealed to one another under at least conventional "low-stringency" 
conditions. Similarly, the molecules are said to be "complementary" if they can 
hybridize to one another with sufficient stability to permit them to remain annealed to one 
another under conventional "high-stringency" conditions. Nucleic acid molecules that 
hybridize to other nucleic acid molecules, e.g., at least under low stringency conditions 
are said to be "hybridizable cognates" of the other nucleic acid molecules. Conventional 
low stringency and high stringency conditions are described herein and by Sambrook et 
al, Molecular Cloning, A Laboratory Manual, 2nd Ed., Cold Spring Harbor Press, Cold 
Spring Harbor, New York (1989) and by Haymes et al, Nucleic Acid Hybridization, A 
Practical Approach, IRL Press, Washington, DC (1985). Departures from complete 
complementarity are permissible, as long as such departures do not completely preclude 
the capacity of the molecules to form a double-stranded structure. 

The term "substantially homologous" refers to two sequences which are at least 
90% identical in sequence, as measured by the BestFit program described herein (Version 
10; Genetics Computer Group, Inc., Madison, WI), using default parameters. 
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"Substantially purified" refers to a molecule separated from substantially all other 
molecules normally associated with it in its native state. More preferably a substantially 
purified molecule is the predominant species present in a preparation. A substantially 
purified molecule may be greater than 60% free, preferably 75% free, more preferably 
90% free, and most preferably 95% free from the other molecules (exclusive of solvent) 
present in the natural mixture. The term "substantially purified" is not intended to 
encompass molecules present in their native state. 

The term "transformation" refers to the introduction of nucleic acid into a 
recipient host. The term "host" refers to bacteria cells, fungi, animals and animal cells, 
plants and plant cells, or any plant parts or tissues including protoplasts, calli, roots, 
tubers, seeds, stems, leaves, seedlings, embryos, and pollen. 

The term "transgenic" refers to an animal, plant, or other organism containing one 
or more heterologous nucleic acid sequences. 

DETAILED DESCRIPTION OF THE INVENTION 

The present invention includes nucleic acid molecules comprising promoter 
sequences useful for transcribing a heterologous structural nucleic acid sequence in 
plants, and methods of modifying, producing, and using the same. The invention also 
includes compositions, transformed host cells, transgenic plants, and seeds containing the 
promoters, and methods for preparing and using the same. 
Nucleic Acid Molecules 

The present invention includes a nucleic acid molecule having a nucleic acid 
sequence that hybridizes to SEQ ID NO:l through SEQ ID NO:57,467, or any 
complements thereof; or any fragments thereof. The present invention also provides a 
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nucleic acid molecule comprising a nucleic acid sequence selected from the group 
consisting of SEQ ID NO: 1 through SEQ ID NO:57,467, any complements thereof, and 
any fragments thereof. 

Nucleic acid hybridization is a technique well known to those of skill in the art of 
DNA manipulation. The hybridization properties of a given pair of nucleic acids are an 
indication of their similarity or identity. 

Low stringency conditions may be used to select nucleic acid sequences with 
lower sequence identities to a target nucleic acid sequence. One may wish to employ 
conditions such as about 0.15 M to about 0.9 M sodium chloride, at temperatures ranging 
from about 20°C to about 55°C. 

High stringency conditions may be used to select for nucleic acid sequences with 
higher degrees of identity to the disclosed nucleic acid sequences (Sambrook et al, 
1989). 

High stringency conditions typically involve nucleic acid hybridization in about 
2X to about 10X SSC (diluted from a 20X SSC stock solution containing 3 M sodium 
chloride and 0.3 M sodium citrate, pH 7.0 in distilled water), about 2.5X to about 5X 
Denhardt's solution (diluted from a SOX stock solution containing 1% (w/v) bovine 
serum albumin, 1% (w/v) ficoll, and 1% (w/v) polyvinylpyrrolidone in distilled water), 
about 10 mg/mL to about 100 mg/mL fish sperm DNA, and about 0.02% (w/v) to about 
0.1% (w/v) SDS, with an incubation at about 50°C to about 70°C for several hours to 
overnight. High stringency conditions are preferably provided by 6X SSC, 5X 
Denhardt's solution, 100 mg/mL fish sperm DNA, and 0.1% (w/v) SDS, with an 
incubation at 55°C for several hours. 
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Hybridization is generally followed by several wash steps. The wash 
compositions generally comprise 0.5X to about 10X SSC, and 0.01% (w/v) to about 0.5% 
(w/v) SDS with a 15 minute incubation at about 20°C to about 70°C. Preferably, the 
nucleic acid segments remain hybridized after washing at least one time in 0.1X SSC at 
65°C. 

A nucleic acid molecule preferably comprises a nucleic acid sequence that 
hybridizes, under low or high stringency conditions, with SEQ ID NO:l through SEQ ID 
NO:57,467, any complements thereof, or any fragments thereof. A nucleic acid molecule 
most preferably comprises a nucleic acid sequence that hybridizes under high stringency 
conditions with SEQ ID NO:l through SEQ ID NO:57,467, any complements thereof, or 
any fragments thereof. 

In an alternative embodiment, the nucleic acid molecule comprises a nucleic acid 
sequence that exhibits 85% or greater identity, and more preferably at least 86 or greater, 
87 or greater, 88 or greater, 89 or greater, 90 or greater, 91 or greater, 92 or greater, 93 or 
greater, 94 or greater, 95 or greater, 96 or greater, 97 or greater, 98 or greater, or 99% or 
greater identity to a nucleic acid molecule selected from the group consisting of SEQ ID 
NO:l through SEQ ID NO:57,467 and complements thereof. The nucleic acid molecule 
most preferably comprises a nucleic acid sequence selected from the group consisting of 
SEQ ID NO:l through SEQ ID NO:57,467 and complements thereof. 

The percent of sequence identity is preferably determined using the "Best Fit" or 
"Gap" program of the Sequence Analysis Software Package™ (Version 10; Genetics 
Computer Group, Inc., Madison, WI). "Gap" utilizes the algorithm of Needleman and 
Wunsch (Needleman and Wunsch, Journal of Molecular Biology 48:443-453, 1970) to 
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find the alignment of two sequences that maximizes the number of matches and 
minimizes the number of gaps. "BestFit" performs an optimal alignment of the best 
segment of similarity between two sequences and inserts gaps to maximize the number of 
matches using the local homology algorithm of Smith and Waterman (Smith and 
Waterman, Advances in Applied Mathematics, 2:482-489, 1981, Smith et al, Nucleic 
Acids Research 1 1 :2205-2220, 1983). The percent identity is most preferably determined 
using the "Best Fit" program. 

As used herein "sequence identity" refers to the extent to which two optimally 
aligned polynucleotide or peptide sequences are invariant throughout a window of 
alignment of components, e.g., nucleotides or amino acids. An "identity fraction" for 
aligned segments of a test sequence and a reference sequence is the number of identical 
components which are shared by the two aligned sequences divided by the total number 
of components in reference sequence segment, i.e., the entire reference sequence or a 
smaller defined part of the reference sequence. "Percent identity" is the identity fraction 
times 100. 

Useful methods for determining sequence identity are also disclosed in Guide to 
Huge Computers, Martin J. Bishop, ed., Academic Press, San Diego, 1994, and Carillo, 
H., and Lipton, D., Applied Math (1988) 48:1073. More particularly, preferred computer 
programs f6r determining sequence identity include the Basic Local Alignment Search 
Tool (BLAST) programs which are publicly available from National Center 
Biotechnology Information (NCBI) at the National Library of Medicine, National 
Institute of Health, Bethesda, Md. 20894; see BLAST Manual, Altschul et al, NCBI, 
NLM, NIH; Altschul et al, J. Mol. Biol. 275:403-410 (1990); version 2.0 or higher of 
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BLAST programs allows the introduction of gaps (deletions and insertions) into 
alignments; for peptide sequence BLASTX can be used to determine sequence identity; 
and, for polynucleotide sequence BLASTN can be used to determine sequence identity. 

For purposes of this invention "percent identity" may also be determined using 
BLASTX version 2.0 for translated nucleotide sequences and BLASTN version 2.0 for 
polynucleotide sequences. In a preferred embodiment of the present invention, the 
presently disclosed rice genomic promoter sequences comprise nucleic acid molecules o: 
fragments having a BLAST score of more than 200, preferably a BLAST score of more 
than 300, and even more preferably a BLAST score of more than 400 with their 
respective homologues. 

Nucleic acid molecules of the present invention include nucleic acid sequences 
that are between about 0.01 Kb and about 50 Kb, more preferably between about 0.1 Kb 
and about 25 Kb, even more preferably between about 1 Kb and about 10 Kb, and most 
preferably between about 3 Kb and about 10 Kb, about 3 Kb and about 7 Kb, about 4 Kb 
and about 6 Kb, about 2 Kb and about 4 Kb, about 2 Kb and about 5 Kb, about 1 Kb and 
about 5 Kb, about 1 Kb and about 3 Kb, or about 1 Kb and about 2 Kb. 
Promoters 

Any of the nucleic acid molecules described herein may comprise nucleic acid 
sequences comprising promoters. Promoters of the present invention can include 
between about 300 bp upstream and about 10 kb upstream of the trinucleotide ATG 
sequence at the start site of a protein coding region. Promoters of the present invention 
can preferably include between about 300 bp upstream and about 5 kb upstream of the 
trinucleotide ATG sequence at the start site of a protein coding region. Promoters of the 
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present invention can more preferably include between about 300 bp upstream and about 
2 kb upstream of the trinucleotide ATG sequence at the start site of a protein coding 
region. Promoters of the present invention can include between about 300 bp upstream 
and about 1 kb upstream of the trinucleotide ATG sequence at the start site of a protein 
coding region. While in many circumstances a 300 bp promoter may be sufficient for 
expression, additional sequences may act to further regulate expression, for example, in 
response to biochemical, developmental or environmental signals. 

It is also preferred that the promoters of the present invention contain a CAAT 
and a TATA cis element. Moreover, the promoters of the present invention can contain 
one or more cis elements in addition to a CAAT and a TATA box. 

By "regulatory element" it is intended a series of nucleotides that determines if, 
when, and at what level a particular gene is expressed. The regulatory DNA sequences 
specifically interact with regulatory or other proteins. Many regulatory elements act in 
cis ("cis elements") and are believed to affect DNA topology, producing local 
conformations that selectively allow or restrict access of RNA polymerase to the DNA 
template or that facilitate selective opening of the double helix at the site of 
transcriptional initiation. Cis elements occur within, but are not limited to promoters, and 
promoter modulating sequences (inducible elements). Cis elements can be identified 
using known cis elements as a target sequence or target motif in the BLAST programs of 
the present invention. 



Promoters of the present invention include homologues of cis elements known 
effect gene regulation that show homology with the promoter sequences of the present 
invention. These cis elements include, but are not limited to, oxygen 



responsive cis 
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elements (Cowen et al.,J Biol. Chem. 268(36). 26904-26910 (1993)), light regulatory 
elements (Bruce and Quaill, Plant Cell 2 (1 1): 1081-1089 (1990); Bruce et al, EMBOJ. 
70:3015-3024 (1991); Rocholl etal., Plant Sci. 97.189-198 (1994); Block et al, Proc. 
Natl. Acad. Sci. USA 57:5387-5391 (1990); Giuliano etal, Proc. Natl. Acad. Sci. USA 
55:7089-7093 (1988); Staiger et al, Proc. Natl. Acad. Sci. USA 55:6930-6934 (1989); 
Izawa et al, Plant Cell 5:1277-1287 (1994); Menkens et al, Trends in Biochemistry 
20:506-510 (1995); Foster et al, FASEBJ. 5:192-200 (1994); Plesse et al, Mol Gen 
Gene 254:258-266 (1997); Green etal, EMBOJ. 5:2543-2549 (1987); Kuhlemeier etal, 
Ann. Rev Plant Physiol. 55:221-257 (1987); Villain et al, J. Biol Chem. 277:32593- 
32598 (1996); Lam et al, Plant Cell 2:857-866 (1990); Gilmartin et al, Plant Cell 
2:369-378 (1990); Datta^a/., Plant Cell 7:1069-1077 (1989); Gilmartin etal, Plant 
Cell 2:369-378 (1990); Castresana et al, EMBOJ. 7:1929-1936 (1988); Ueda etal, 
Plant Cell 7:217-227 (1989); Terzaghi etal, Annu. Rev. Plant Physiol Plant Mol Biol. 
45:445-474 (1995); Green etal, EMBOJ. 6:2543-2549 (1987); Villain et al, J. Biol. 
Chem. 277:32593-32598 (1996); Tjaden et al, Plant Cell 6: 107-1 1 8 (1994); Tjaden et 
al, Plant Physiol 705:1 109-1 117 (1995); Ngai et al, Plant J. 72:1021-1234 (1997); 
Bruce et al, EMBOJ. 70:3015-3024 (1991); Ngai et al, Plant J. 72:1021-1034 (1997)), 
elements responsive to gibberellin, (Muller et al, J. Plant Physiol. 745:606-613 (1995); 
Croissant et al, Plant Science 775:27-35 (1996); Lohmer^a/., EMBOJ. 70:617-624 
(1991); Rogers et al, Plant Cell 4:1443-1451 (1992); Lanahan et al, Plant Cell 4:203- 
211 (1992); Skriver et al, Proc. Natl Acad. Sci. USA 55:7266-7270 (1991); Gilmartin et 
al, Plant Cell 2:369-378 (1990); Huang etal, Plant Mol. Biol. 74:655-668 (1990),, 
Gublere^/., PlantCell 7:1879-1891 (1995)), elements responsive to abscisic acid, 
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(Busk et al, Plant Cell 9:2261-2270 (1997); Guiltinan et al, Science 250:267-270 
(1990); Shen et al, Plant Cell 7:295-307 (1995); Shen et al, Plant Cell 8:1 107-1 1 19 
(1996); Seo et al., Plant Mol. Biol. 27:1119-1131 (1995); Marcotte et al, Plant Cell 
1:969-976 (1989); Shen et al, Plant Cell 7:295-307 (1995); Iwasaki etal, Mol Gen 
Genet 247:391-398 (1995); Hattori et al., Genes Dev. 6:609-618 (1992); Thomas et al, 
Plant Cell 5:1401-1410 (1993)), elements similar to abscisic acid responsive elements, 
(Ellerstrom etal, Plant Mol. Biol. 52:1019-1027 (1996)), auxin responsive elements (Lh 
et al, Plant Cell 6:645-657 (1994); Liu et al, Plant Physiol. 1 15:397-407 (1997); 
Kosugi etal, Plant J. 7:877-886 (1995); Kosugi etal, Plant Cell 9:1607-1619 (1997); 
Ballas et al, J. Mol. Biol. 255:580-596 (1993)), a cis element responsive to methyl 
jasmonate treatment (Beaudoin and Rothstein, Plant Mol. Biol 55:835-846 (1997)), a cis 
element responsive to abscisic acid and stress response (Straub et al, Plant Mol. Biol. 
26:617-630 (1994)), ethylene responsive cis elements (Itzhaki etal, Proc. Natl. Acad. 
Sci. USA 91:8925-8929 (1994); Montgomery etal, Proc. Natl Acad. Sci. USA 90:5939- 
5943 (1993;; Sessae/a/., Plant Mol Biol. 28:145-153 (1995); Shinshi etal, Plant Mol. 
Biol. 27:923-932 (1995)), salicylic acid cis responsive elements, (Strange etal, Plant J. 
1 1:1315-1324 (1997); Qin etal, Plant Cell 6:863-874 (1994)), a cis element that 
responds to water stress and abscisic acid (Lam etal, J. Biol Chem. 266:17131-17135 
(1991); Thomas et al, Plant Cell 5:1401-1410 (1993); Pla et al, Plant Mol Biol 21 :259- 
266 (1993)), a cis element essential for M phase-specific expression (Ito etal, Plant Cell 
10:331-341 (1998)), sucrose responsive elements (Huang etal, Plant Mol. Biol. 14:655- 
668 (1990); Hwang etal, Plant Mol Biol 36:331-341 (1998); Grierson etal, Plant J. 
5:815-826 (1994;;, heat shock response elements (Pelham etal, Trends Genet. 1:31-35 
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(1985)), elements responsive to auxin and/or salicylic acid and also reported for light 
regulation (Lam et al, Proc. Natl. Acad. Sci. USA 86:7890-7897 (1989); Benfey et al, 
Science 250:959-966 ( 1 990)), elements responsive to ethylene and salicylic acid (Ohme- 
Takagi et al, Plant Mol. Biol. 15:941-946 (1990)), elements responsive to wounding and 
abiotic stress (Loake etal, Proc. Natl. Acad. Sci. USA 89:9230-9234 (1992); Mhiri et al, 
Plant Mol. Biol. 33:257-266 (1997)), antoxidant response elements (Rushmore et al, J. 
Biol. Chem. 266:11632-11639; Dalton etal, Nucleic Acids Res. 22:5016-5023 (1994)), 
Sph elements (Suzuki et al, Plant Cell P. 799-807 1997)), elicitor responsive elements, 
(Fukuda et al, Plant Mol. Biol 34:81-87 (1997); Rushton etal, EMBOJ. 15:5690-5700 
(1996)), metal responsive elements (Stuart etal, Nature 317:828-831 (1985); Westin et 
al, EMBOJ. 7:3763-3770 (1988); Thiele et al, Nucleic Acids Res. 20:1183-1191 (1992); 
Faisst et al, Nucleic Acids Res. 20:3-26 (1992)), low temperature responsive elements, 
(Baker etal, Plant Mol Biol 24:701-713 (1994); Jiang etal, Plant Mol Biol. 30:679- 
684 (1996); Nordin etal, Plant Mol. Biol 21:641-653 (1993); Zhou etal, J. Biol. 
Chem. 267:23515-23519 (1992)), drought responsive elements, (Yamaguchi etal, Plant 
Cell 6:251-264 (1994); Wang etal, Plant Mol. Biol. 28:605-617 (1995); Bray EA, 
Trends in Plant Science 2. 48-54 (1997)) enhancer elements for glutenin, (Colot et al, 
EMBOJ. 6:3559-3564 (1987); Thomas etal, Plant Cell 2:1 171-1 180 (1990); Kreis et 
al, Philos. Trans. R. Soc. Lond., B314:355-365 (1986)), light-independent regulatory 
elements, (Lagrange et al, Plant Cell 9: 1469-1479 (1997); Villain et al, J. Biol Chem. 
271:32593-32598 (1996)), OCS enhancer elements, (Bouchez et al, EMBOJ. 8:4197- 
4204 (1989); Foley et al, Plant J. 3:669-679 (1993)), ACGT elements, (Foster et al, 
FASEBJ. 8:192-200 (1994); Izawae/a/., Plant Cell 6:1277-1287 (1994); IzawaeM/., J. 
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Mol. Biol. 230:1 131-1 144 (1993)), negative cis elements in plastid related genes, (Zhou 
etal, J. Biol. Chem. 267:23515-23519 (1992); Lagrange etal, Mol. Cell Biol. 13:2614- 
2622 (1993); Lagrange et al, Plant Cell 9: 1469-1479 (1997); Zhou et al, J. Biol. Chem. 
267:23515-23519 (1992)), prolamin box elements, (Forde etal, Nucleic Acids Res. 
13:7327-7339 (1985); Colot etal., EMBOJ. 6:3559-3564 (1987.); Thomas etal, Plant 
Cell 2:1 171-1 180 (1990); Thompson et al, Plant Mol. Biol 15:755-764 (1990); Vicente 
et al, Proc. Natl. Acad. Sci. USA 94:7685-7690 (1 997)), elements in enhancers from the 
IgM heavy chain gene (Gillies et al, Cell 33:717-728 (1983); Whittier et al, Nucleic 
Acids Res. 15:2515-2535 (1987)). 
Promoter Activity 

The activity or strength of a promoter may be measured in terms of the amount of 
mRNA or protein accumulation it specifically produces, relative to the total amount of 
mRNA or protein. The promoter preferably expresses an operably linked nucleic acid 
sequence at a level greater than 0.01%; more preferably greater than 0.05, 0.1, 0.25, 0.5, 
0.75, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1 1, 12, 13, 14, 15, 16, 17, 18, 19, or 20% (w/w) of the 
total cellular RNA or protein. 

As used herein, an "expression pattern" is any pattern of differential gene 
expression. In a preferred embodiment, an expression pattern is selected from the group 
consisting of tissue, temporal, spatial, developmental, stress, environmental, 
physiological, pathological, cell cycle, and chemically responsive expression patterns. 

As used herein, an "enhanced expression pattern" is any expression pattern for 
which an operably linked nucleic acid sequence is expressed at a level greater than 
0.01%; more preferably greater than 0.05, 0.1, 0.25, 0.5, 0.75, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 
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11, 12, 13, 14, 15, 16, 17, 18, 19, or 20% (w/w) of the total cellular RNA or protein. 

Alternatively, the activity or strength of a promoter may be expressed relative to a 
well-characterized promoter (for which transcriptional activity was previously assessed). 
For example, a less-characterized promoter may be operably linked to a reporter sequence 
(e.g., GUS) and introduced into a specific cell type. A well-characterized promoter (e.g. 
the 35S promoter) is similarly prepared and introduced into the same cellular context. 
Transcriptional activity of the unknown promoter is determined by comparing the amount 
of reporter expression, relative to the well characterized promoter. In one embodiment, 
the activity of the present promoter is as strong as the 35S promoter when compared in 
the same cellular context. The cellular context is preferably rice, sorghum, maize, barley, 
wheat, canola, soybean, or maize; and more preferably is rice, sorghum, maize, barley, or 
wheat; and most preferably is rice. 
Structural Nucleic Acid Sequences 

The promoter of the present invention may be operably linked to a structural 
nucleic acid sequence that is heterologous with respect to the promoter. The structural 
nucleic acid sequence may generally be any nucleic acid sequence for which an increased 
level of transcription is desired. The structural nucleic acid sequence preferably encodes 
a polypeptide that is suitable for incorporation into the diet of a human or an animal. 
Suitable structural nucleic acid sequences include those encoding a yield protein, a stress 
resistance protein, a developmental control protein, a tissue differentiation protein, a 
meristem protein, an environmentally responsive protein, a senescence protein, a 
hormone responsive protein, an abscission protein, a source protein, a sink protein, a 
flower control protein, a seed protein, an herbicide resistance protein, a disease resistance 
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protein, a fatty acid biosynthetic enzyme, a tocopherol biosynthetic enzyme, an amino 
acid biosynthetic enzyme, and an insecticidal protein. 

Alternatively, the promoter and structural nucleic acid sequence may be designed 
to down-regulate a specific nucleic acid sequence. This is typically accomplished by 
linking the promoter to a structural nucleic acid sequence that is oriented in the antisense 
direction. One of ordinary skill in the art is familiar with such antisense technology. 
Briefly, as the antisense nucleic acid sequence is transcribed, it hybridizes to and 
sequesters a complimentary nucleic acid sequence inside the cell. This duplex RNA 
molecule cannot be translated into a protein by the cell's translational machinery. Any 
nucleic acid sequence may be negatively regulated in this manner. 
Modified Structural Nucleic Acid Sequences 

The promoter of the present invention may also be operably linked to a modified 
structural nucleic acid sequence that is heterologous with respect to the promoter. The 
structural nucleic acid sequence may be modified to provide various desirable features. 
For example, a structural nucleic acid sequence may be modified to increase the content 
of essential amino acids, enhance translation of the amino acid sequence, alter post- 
translational modifications (e.g., phosphorylation sites), transport a translated product to a 
compartment inside or outside of the cell, improve protein stability, insert or delete cell 
signaling motifs, etc. 

Codon Usa ge in Structural Nucleic Acid Sequences 

Due to the degeneracy of the genetic code, different nucleotide codons may be 
used to code for a particular amino acid. A host cell often displays a preferred pattern of 
codon usage. Structural nucleic acid sequences are preferably constructed to utilize the 
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codon usage pattern of the particular host cell. This generally enhances the expression of 
the structural nucleic acid sequence in a transformed host cell. Any of the above 
described nucleic acid and amino acid sequences may be modified to reflect the preferred 
codon usage of a host cell or organism in which they are contained. Modification of a 
structural nucleic acid sequence for optimal codon usage in plants is described in U.S. 
Patent No. 5,689,052. 

Other Modifications of Structural Nucleic Acid Sequences 

Additional variations in the structural nucleic acid sequences described above may 
encode proteins having equivalent or superior characteristics when compared to the 
proteins from which they are engineered. Mutations may include deletions, insertions, 
truncations, substitutions, fusions, shuffling of motif sequences, and the like. 

Mutations to a structural nucleic acid sequence may be introduced in either a 
specific or random manner, both of which are well known to those of skill in the art of 
molecular biology. A myriad of site-directed mutagenesis techniques exist, typically 
using oligonucleotides to introduce mutations at specific locations in a structural nucleic 
acid sequence. Examples include single strand rescue (Kunkel et al, Proa Natl Acad. 
Set U.S.A., 82: 488-492, 1985), unique site elimination (Deng and Nickloff, Anal 
Biochem. 200:81, 1992), nick protection (Vandeyar, et al Gene 65: 129-133, 1988), and 
PCR (Costa et al, Methods Mol Biol 57: 31-44, 1996). Random or non-specific 
mutations may be generated by chemical agents (for a general review, see Singer and 
Kusmierek, Ann. Rev. Biochem. 52: 655-693, 1982) such as nitrosoguanidine (Cerda- 
Olmedo et al. 9 J. Mol Biol 33: 705-719, 1968; Guerola, et al Nature New Biol 230: 
122-125, 1971) and 2-aminopurine (Rogan and Bessman, J. Bacteriol 103: 622-633, 
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1970); or by biological methods such as passage through mutator strains (Greener, et al 
Mol Biotechnol. 7: 189-195, 1997). 

The modifications may result in either conservative or non-conservative changes 
in the amino acid sequence. Conservative changes result from additions, deletions, 
substitutions, etc. in the structural nucleic acid sequence which do not alter the final 
amino acid sequence of the protein. In a preferred embodiment, the protein has between 
20 and 500 conservative changes, more preferably between 15 and 300 conservative 
changes, even more preferably between 10 and 150 conservative changes, and most 
preferably between 5 and 75 conservative changes. 

Non-conservative changes include additions, deletions, and substitutions which 
result in an altered amino acid sequence. In a preferred embodiment, the protein has 
between 10 and 250 non-conservative amino acid changes, more preferably between 5 
and 100 non-conservative amino acid changes, even more preferably between 2 and 50 
non-conservative amino acid changes, and most preferably between 1 and 30 non- 
conservative amino acid changes. 

Additional methods of making the alterations described above are described by 
Ausubel et al, Current Protocols in Molecular Biology, John Wiley and Sons, Inc., 1995, 
Bauer et al, Gene, ?>1;1?>, 1985; Craik, BioTechniques, 3: 12-19, 1985; Frits Eckstein et 
al, Nucleic Acids Research, 10: 6487-6497, 1982; Sambrook, et al, Molecular Cloning: 
A Laboratory Manual, Second Edition, Cold Spring Harbor Laboratory Press, Cold 
Spring Harbor, N.Y., 1989, Smith, et al, In: Genetic Engineering: Principles and 
Methods, Setlow et al, Eds., Plenum Press, N.Y., 1-32, 1981, and Osuna, et al, Critical 
Reviews In Microbiology, 20: 107-116, 1994. 
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Modifications may be made to the protein sequences of the present invention and 
the nucleic acid segments which encode them that maintain the desired properties of the 
molecule. The following is a discussion based upon changing the amino acid sequence of 
a protein to create an equivalent, or possibly an improved, second-generation molecule. 
The amino acid changes may be achieved by changing the codons of the structural 
nucleic acid sequence, according to the codons given in Table A. 

Table A: Codon degeneracy of amino acids 

Amino acidOne letterThree letterCodonsAlanineAAlaGCA GCC GCG 
GCTCysteineCCysTGC TGTAspartic acidDAspGAC GATGlutamic acidEGluGAA 
GAGPhenylalanineFPheTTC TTTGlycineGGlyGGA GGC GGG GGTHistidineHHisCAC 
CATIsoleucinellleATA ATC ATTLysineKLysAAA AAGLeucineLLeuTTA TTG CTA CTC CTG 
CTTMethionineMMetATGAsparagineNAsnAAC AATProlinePProCCA CCC CCG 
CCTGlutamineQGlnCAA CAGArginineRArgAGA AGG CGA CGC CGG CGTSerineSSerAGC AGT 
TCA TCC TCG TCTThreonineTThrACA ACC ACG ACTValineVValGTA GTC GTG 
GTTTryptophanWTrpTGGTyrosineYTyrTAC TATCertain amino acids may be substituted for 

other amino acids in a protein sequence without appreciable loss of the desired activity. 
It is thus contemplated that various changes may be made in the peptide sequences of the 
disclosed protein sequences, or their corresponding nucleic acid sequences without 
appreciable loss of the biological activity. 

In making such changes, the hydropathic index of amino acids may be considered. 
The importance of the hydropathic amino acid index in conferring interactive biological 
function on a protein is generally understood in the art (Kyte and Doolittle, J. Mol Biol, 
157: 105-132, 1982). It is accepted that the relative hydropathic character of the amino 
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acid contributes to the secondary structure of the resultant protein, which in turn defines 
the interaction of the protein with other molecules, for example, enzymes, substrates, 
receptors, DNA, antibodies, antigens, and the like. 

Each amino acid has been assigned a hydropathic index on the basis of their 
5 hydrophobicity and charge characteristics. These are: isoleucine (+4.5); valine (+4.2); 
leucine (+3.8); phenylalanine (+2.8); cysteine/cysteine (+2.5); methionine (+1.9); alanine 
(+1.8); glycine (-0.4); threonine (-0.7); serine (-0.8); tryptophan (-0.9); tyrosine (-1.3); 
proline (-1.6); histidine (-3.2); glutamate/glutamine/aspartate/asparagine (-3.5); lysine (- 
3.9); and arginine (-4.5). 

10 It is known in the art that certain amino acids may be substituted by other amino 

acids having a similar hydropathic index or score and still result in a protein with similar 
biological activity, i.e., still obtain a biologically functional protein. In making such 
changes, the substitution of amino acids whose hydropathic indices are within ±2 is 
preferred, those within ±1 are more preferred, and those within +0.5 are most preferred. 

15 It is also understood in the art that the substitution of like amino acids may be 

made effectively on the basis of hydrophilicity. U.S. Patent No. 4,554,101 (Hopp) states 
that the greatest local average hydrophilicity of a protein, as governed by the 
hydrophilicity of its adjacent amino acids, correlates with a biological property of the 
protein. The following hydrophilicity values have been assigned to amino acids: 

20 arginine/lysine (+3.0); aspartate/glutamate (+3.0 ±1); serine (+0.3); asparagine/glutamine 
(+0.2); glycine (0); threonine (-0.4); proline (-0.5 ±1); alanine/histidine (-0.5); cysteine (- 
1.0); methionine (-1.3); valine (-1.5); leucine/isoleucine (-1.8); tyrosine (-2.3); 
phenylalanine (-2.5); and tryptophan (-3.4). 
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It is understood that an amino acid may be substituted by another amino acid 
having a similar hydrophilicity score and still result in a protein with similar biological 
activity, i.e., still obtain a biologically functional protein. In making such changes, the 
substitution of amino acids whose hydropathic indices are within ±2 is preferred, those 
within ±1 are more preferred, and those within ±0.5 are most preferred. 

As outlined above, amino acid substitutions are therefore based on the relative 
similarity of the amino acid side-chain substituents, for example, their hydrophobicity, 
hydrophilicity, charge, size, and the like. Exemplary substitutions which take various of 
the foregoing characteristics into consideration are well known to those of skill in the art 
and include: arginine and lysine; glutamate and aspartate; serine and threonine; glutamine 
and asparagine; and valine, leucine, and isoleucine. Changes which are not expected to 
be advantageous may also be used if these resulted proteins having improved rumen 
resistance, increased resistance to proteolytic degradation, or both improved rumen 
resistance and increased resistance to proteolytic degradation, relative to the unmodified 
polypeptide from which they are engineered. 
Recombinant Vectors 

Any of the promoters and structural nucleic acid sequences described above may 
be provided in a recombinant vector. A recombinant vector typically comprises, in a 5' 
to 3' orientation: a promoter to direct the transcription of a structural nucleic acid 
sequence and a structural nucleic acid sequence. The recombinant vector may further 
comprise a 3 5 transcriptional terminator, a 3 5 polyadenylation signal, other untranslated 
nucleic acid sequences, transit and targeting nucleic acid sequences, selectable markers, 
enhancers, and operators, as desired. 
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Means for preparing recombinant vectors are well known in the art. Methods for 
making recombinant vectors particularly suited to plant transformation include, without 
limitation, those described in U.S. Patent Nos. 4,971,908, 4,940,835, 4,769,061 and 
4,757,01 1. These type of vectors have also been reviewed (Rodriguez, et al Vectors: A 
5 Survey of Molecular Cloning Vectors and Their Uses, Butterworths, Boston, 1988; Glick 
et al, Methods in Plant Molecular Biology and Biotechnology, CRC Press, Boca Raton, 
Fla., 1993). 

Typical vectors useful for expression of nucleic acids in higher plants are well 
known in the art and include vectors derived from the tumor-inducing (Ti) plasmid of 

10 Agrobacterium tumefaciens (Rogers, et al, Meth. In Enzymol, 153: 253-277, 1987). 
Other recombinant vectors useful for plant transformation, including the pCaMVCN 
transfer control vector, have also been described (Fromm et al, Proc. Natl Acad Scl 
USA, 82(17): 5824-5828, 1985). 
Promoters in the Recombinant Vectors 

15 The promoter used in the recombinant vector preferably transcribes a 

heterologous structural nucleic acid sequence at a high level in a plant. More preferably, 
the promoter hybridizes to a nucleic acid sequence selected from the group consisting of 
SEQ ID NO:l through SEQ ID NO:57,467, or any complements thereof; or any 
fragments thereof. Suitable hybridization conditions include those described above. A 

20 nucleic acid sequence of the promoter preferably hybridizes, under low or high stringency 
conditions, with SEQ ID NO:l through SEQ ID NO:57,467, and any complements 
thereof. The promoter most preferably hybridizes under high stringency conditions to a 
nucleic acid sequence selected from the group consisting of SEQ ID NO: 1 through SEQ 
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ID NO:57,467, and any complements thereof. 

In an alternative embodiment, the promoter comprises a nucleic acid sequence 
that exhibits 85% or greater identity, and more preferably at least 86 or greater, 87 or 
greater, 88 or greater, 89 or greater, 90 or greater, 91 or greater, 92 or greater, 93 or 
greater, 94 or greater, 95 or greater, 96 or greater, 97 or greater, 98 or greater, or 99% or 
greater identity to a nucleic acid sequence selected from the group consisting of SEQ ID 
NO:l through SEQ ID NO:57,467, and complements thereof The promoter most 
preferably comprises a nucleic acid sequence selected from the group consisting of SEQ 
ID NO:l through SEQ ID NO: 5 7,467, any complements thereof, and any fragments 
thereof. 

Additional Promoters in the Recombinant Vector 

One or more additional promoters may also be provided in the recombinant 
vector. These promoters may be operably linked to any of the structural nucleic acid 
sequences described above. Alternatively, the promoters may be operably linked to other 
nucleic acid sequences, such as those encoding transit peptides, selectable marker 
proteins, or antisense sequences. 

These additional promoters may be selected on the basis of the cell type into 
which the vector will be inserted. Promoters which function in bacteria, yeast, and plants 
are all well taught in the art. The additional promoters may also be selected on the basis 
of their regulatory features. Examples of such features include enhancement of 
transcriptional activity, inducibility, tissue-specificity, and developmental stage- 
specificity. In plants, promoters that are inducible, of viral or synthetic origin, 
constitutively active, temporally regulated, and spatially regulated have been described 
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(Poszkowski, etal, EMBOJ., 3: 2719, 1989; Odell, et al, Nature, 313:810, 1985; Chau 
etal, Science, 244:174-181. 1989). 

Often-used constitutive promoters include the CaMV 35 S promoter (Odell, et al, 
Nature, 313:810, 1985), the enhanced CaMV 35S promoter, the Figwort Mosaic Virus 
(FMV) promoter (Richins, et al, Nucleic Acids Res. 20: 8451, 1987), the mannopine 
synthase (mas) promoter, the nopaline synthase (nos) promoter, and the octopine synthase 
(ocs) promoter. 

Useful inducible promoters include promoters induced by salicylic acid or 
polyacrylic acids (PR-1; Williams, et al, Biotechnology 10:540-543, 1992), induced by 
application of safeners (substituted benzenesulfonamide herbicides; Hershey and Stoner, 
Plant Mol Biol 17: 679-690, 1991), heat-shock promoters (Ou-Lee et al, Proc. Natl 
Acad. Sci U.S.A. 83: 6815, 1986; Ainley et al, Plant Mol Biol 14: 949, 1990), anitrate- 
inducible promoter derived from the spinach nitrite reductase structural nucleic acid 
sequence (Back et al, Plant Mol Biol 17: 9, 1991), hormone-inducible promoters 
(Yamaguchi-Shinozaki et al, Plant Mol Biol 15: 905, 1990), and light-inducible 
promoters associated with the small subunit of RuBP carboxylase and LHCP families 
(Kuhlemeier et al, Plant Cell 1 : 471, 1989; Feinbaum et al, Mol. Gen. Genet. 226: 449- 
456, 1991; Weisshaar, etal,EMBOJ. 10: 1777-1786, 1991; Lam and Chua, J. Biol 
Chem. 266: 17131-17135, 1990; Castresana et al, EMBO J. 7: 1929-1936, 1988; 
Schulze-Lefert, etal. 9 EMBOJ. 8: 651, 1989). 

Examples of useful tissue-specific, developmentally-regulated promoters include 
the p-conglycinin 7Sa promoter (Doyle et al, J. Biol Chem. 261: 9228-9238, 1986; 
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Slighton and Beachy, Planta 172: 356, 1987), and seed-specific promoters (Knutzon, et 
al, Proc. Natl Acad. Sci U.S.A. 89: 2624-2628, 1992; Bustos, et al, EMBOJ. 10: 1469- 
1479, 1991; Lam and Chua, Science 248: 471, 1991). Plant functional promoters useful 
for preferential expression in seed plastid include those from plant storage proteins and 
from proteins involved in fatty acid biosynthesis in oilseeds. Examples of such 
promoters include the 5' regulatory regions from such structural nucleic acid sequences 
as napin (Kridl et al, Seed Sci. Res. 1 : 209, 1991), phaseolin, zein, soybean trypsin 
inhibitor, ACP, stearoyl-ACP desaturase, and oleosin. Seed-specific regulation is 
discussed in EP 0 255 378. 

Another exemplary tissue-specific promoter is the lectin promoter, which is 
specific for seed tissue. The Lectin protein in soybean seeds is encoded by a single 
structural nucleic acid sequence (Lei) that is only expressed during seed maturation and 
accounts for about 2 to about 5% of total seed mRNA. The lectin structural nucleic acid 
sequence and seed-specific promoter have been fully characterized and used to direct seed 
specific expression in transgenic tobacco plants (Vodkin, et al, Cell, 34: 1023, 1983; 
Lindstrom, et al, Developmental Genetics, 11: 160, 1990). 

Particularly preferred additional promoters in the recombinant vector include the 
nopaline synthase (nos), mannopine synthase (mas), and octopine synthase (ocs) 
promoters, which are carried on tumor-inducing plasmids of Agrobacterium tumefaciens; 
the cauliflower mosaic virus (CaMV) 19S and 35S promoters; the enhanced CaMV 35S 
promoter; the Figwort Mosaic Virus (FMV) 35 S promoter; the light-inducible promoter 
from the small subunit of ribulose-l,5-bisphosphate carboxylase (ssRUBlSCO); the EIF- 
4A promoter from tobacco (Mandel, et al, Plant Mol Biol, 29: 995-1004, 1995); corn 
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sucrose synthetase 1 (Yang, et al, Proc. Natl. Acad. Sci. USA, 87: 4144-48, 1990); corn 
alcohol dehydrogenase 1 (Vogel, et al., J. Cell Biochem. , (Suppl) 13D: 312, 1989); corn 
light harvesting complex (Simpson, Science, 233: 34, 1986); corn heat shock protein 
(Odell, et al, Nature, 313: 810, 1985); the chitinase promoter from Arabidopsis (Samac, 
et al, Plant Cell, 3:1063-1072, 1991); the LTP (Lipid Transfer Protein) promoters from 
broccoli (Pyee, et al, Plant J., 7: 49-59, 1995); petunia chalcone isomerase (Van Tunen, 
etal, EMBOJ. 7: 1257, 1988); bean glycine rich protein 1 (Keller, et al, EMBO L., 8: 
1309-1314, 1989); Potato patatin (Wenzler, et al, Plant Mol. Biol, 12: 41-50, 1989); the 
ubiquitin promoter from maize (Christensen et al, Plant Mol. Biol, 18: 675,689, 1992); 
and the actin promoter from rice (McElroy, et al, Plant Cell, 2:163-171, 1990). 

The additional promoter is preferably seed selective, tissue selective, constitutive, 
or inducible. The promoter is most preferably the nopaline synthase (NOS), octopine 
synthase (OCS), mannopine synthase (MAS), cauliflower mosaic virus 19S and 35S 
(CaMV19S, CaMV35S), enhanced CaMV (eCaMV), ribulose 1,5-bisphosphate 
carboxylase (ssRUBISCO), figwort mosaic virus (FMV), CaMV derived AS4, tobacco 
RB7, wheat POX1, tobacco EIF-4, lectin protein (Lei), or rice RC2 promoter. 
Structural Nucleic Acid Sequences in the Recombinant Nucleic Acid Vector 

The promoter in the recombinant vector is preferably operably linked to a 
structural nucleic acid sequence. Exemplary structural nucleic acid sequences, and 
modified forms thereof, are described in detail above. The promoter of the present 
invention may be operably linked to a structural nucleic acid sequence that is 
heterologous with respect to the promoter. In one aspect, the structural nucleic acid 
sequence may generally be any nucleic acid sequence for which an increased level of 
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transcription is desired. The structural nucleic acid sequence preferably encodes a 
polypeptide that is suitable for incorporation into the diet of a human or an animal. 
Suitable structural nucleic acid sequences include those encoding a yield protein, a stress 
resistance protein, a developmental control protein, a tissue differentiation protein, a 
meristem protein, an environmentally responsive protein, a senescence protein, a 
hormone responsive protein, an abscission protein, a source protein, a sink protein, a 
flower control protein, a seed protein, an herbicide resistance protein, a disease resistance 
protein, a fatty acid biosynthetic enzyme, a tocopherol biosynthetic enzyme, an amino 
acid biosynthetic enzyme, and an insecticidal protein. 

Alternatively, the promoter and structural nucleic acid sequence may be designed 
to down-regulate a specific nucleic acid sequence. This is typically accomplished by 
linking the promoter to a structural nucleic acid sequence that is oriented in the antisense 
direction. One of ordinary skill in the art is familiar with such antisense technology. 
Using such an approach, a cellular nucleic acid sequence is effectively down regulated as 
the subsequent steps of translation are disrupted. Nucleic acid sequences may be 
negatively regulated in this manner. 

Recombinant Vectors having Additional Structural Nucleic Acid Sequences 

The recombinant vector may also contain one or more additional structural 
nucleic acid sequences. These additional structural nucleic acid sequences may generally 
be any sequences suitable for use in a recombinant vector. Such structural nucleic acid 
sequences include any of the structural nucleic acid sequences, and modified forms 
thereof, described above. The additional structural nucleic acid sequences may also be 
operably linked to any of the above described promoters. The one or more structural 
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nucleic acid sequences may each be operably linked to separate promoters. Alternatively, 
the structural nucleic acid sequences may be operably linked to a single promoter (/. e. a 
single operon). 

The additional structural nucleic acid sequences preferably encode a yield protein, 
a stress resistance protein, a developmental control protein, a tissue differentiation 
protein, a meristem protein, an environmentally responsive protein, a senescence protein, 
a hormone responsive protein, an abscission protein, a source protein, a sink protein, a 
flower control protein, a seed protein, an herbicide resistance protein, a disease resistance 
protein, a fatty acid biosynthetic enzyme, a tocopherol biosynthetic enzyme, an amino 
acid biosynthetic enzyme, and an insecticidal protein. 

Alternatively, the second structural nucleic acid sequence may be designed to 
down-regulate a specific nucleic acid sequence. This is typically accomplished by 
operably linking the second structural amino acid, in an antisense orientation, with a 
promoter. One of ordinary skill in the art is familiar with such antisense technology. The 
process is also briefly described above. Any nucleic acid sequence may be negatively 
regulated in this manner. 
Selectable Markers 

The recombinant vector may further comprise a selectable marker. The nucleic 
acid sequence serving as the selectable marker functions to produce a phenotype in cells 
which facilitates their identification relative to cells not containing the marker. 

Examples of selectable markers include, but are not limited to, a neo gene 
(Potrykus, etal.Ann. Rev. Plant Physiol Plant MoL Biol, 42: 205, 1991), which codes 
for kanamycin resistance and can be selected for using kanamycin, G418, etc.; a bar gene 
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which codes for bialaphos resistance; a mutant EPSP synthase gene (Hinchee, et al, 
Bio/Technology 6:915-922, 1988) which encodes glyphosate resistance; a nitrilase gene 
which confers resistance to bromoxynil (Stalker, et al, J. Biol. Chem. 265:6310-6314, 
1988); a mutant acetolactate synthase gene (ALS) which confers imidazolinone or 
5 sulphonylurea resistance (European Patent Application No. 0 1 54204); green fluorescent 
protein (GFP); and a methotrexate resistant DHFR gene (Thillet, et al, J. Biol Chem. 
263:12500-12508, 1988). 

Other exemplary selectable markers include: a p-glucuronidase or uidA gene 
(GUS), which encodes an enzyme for which various chromogenic substrates are known 
% 10 (Jefferson, Plant Mol. Biol, Rep. 5:387-405, 1987; Jefferson, et al, EMBOJ. 5:3901- 
U 3907, 1987); an R-locus gene, which encodes a product that regulates the production of 

* * ™ 

fU anthocyanin pigments (red color) in plant tissues (Dellaporta et al, Stadler Symposium 

+ 77:263-282, 1988); a p-lactamase gene (Sutcliffe et al, Proc. Natl Acad. Sci. (U.S.A.) 

Hi 75:3737-3741, 1978), which encodes an enzyme for which various chromogenic 

; ; 5 

hi 1 5 substrates are known {e.g. , PAD AC, a chromogenic cephalosporin); a luciferase gene 

p *» 

r= (Ow, et al, Science 234:856-859, 1 986); a xylE gene (Zukowsky, et al, Proc. Natl. 

Acad. Sci. (U.S.A.) 80:1 101-1 105, 1983) which encodes a catechol dioxygenase that can 
convert chromogenic catechols; an a-amylase gene (Ikatu et al, Bio/Technol. 5:241-242, 
1990); a tyrosinase gene (Katz et al, J. Gen. Microbiol. 729:2703-2714, 1983), which 
20 encodes an enzyme capable of oxidizing tyrosine to DOPA and dopaquinone (which in 
turn condenses to melanin); and an a-galactosidase, which will turn a chromogenic a- 
galactose substrate. 
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Included within the term "selectable markers" are also genes which encode a 
secretable marker whose secretion can be detected as a means of identifying or selecting 
for transformed cells. Examples include markers that encode a secretable antigen that can 
be identified by antibody interaction, or even secretable enzymes which can be detected 
catalytically. Selectable secreted marker proteins fall into a number of classes, including 
small, diffusible proteins which are detectable, (e.g., by ELISA), small active enzymes 
which are detectable in extracellular solution (e.g., a-amylase, (5-lactamase, 
phosphinothricin transferase), or proteins which are inserted or trapped in the cell wall 
(such as proteins which include a leader sequence such as that found in the expression 
unit of extension or tobacco PR-S). Other possible selectable marker genes will be 
apparent to those of skill in the art. 

The selectable marker is preferably GUS, green fluorescent protein (GFP), 
neomycin phosphotransferase II (nptll), luciferase (LUX), an antibiotic resistance coding 
sequence, or an herbicide (e.g., glyphosate) resistance coding sequence. The selectable 
marker is most preferably a kanamycin, hygromycin, or herbicide resistance marker. 
Other Elements in the Recombinant Vector 

Various cis-acting untranslated 5 ' and 3 ' regulatory sequences may be included in 
the recombinant nucleic acid vector. Any such regulatory sequences may be provided in 
a recombinant vector with other regulatory sequences. Such combinations can be 
designed or modified to produce desirable regulatory features. 

A y non-translated region typically provides a transcriptional termination signal, 
and a polyadenylation signal which functions in plants to cause the addition of adenylate 
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nucleotides to the 3' end of the mRNA. These may be obtained from the 3' regions to the 
nopaline synthase (nos) coding sequence, the soybean 7Sa storage protein coding 
sequence, the albumin coding sequence, and the pea ssRUBlSCO E9 coding sequence. 
Particularly preferred 3' nucleic acid sequences include nos 3', E9 3', ADR 12 3*, 7Sot 3\ 
US 3', and albumin 3'. 

Typically, nucleic acid sequences located a few hundred base pairs downstream of 
the polyadenylation site serve to terminate transcription. These regions are required for 
efficient polyadenylation of transcribed mRNA. 

Translational enhancers may also be incorporated as part of the recombinant 
vector. Thus the recombinant vector may preferably contain one or more 5' non- 
translated leader sequences which serve to enhance expression of the nucleic acid 
sequence. Such enhancer sequences may be desirable to increase or alter the translational 
efficiency of the resultant mRNA. Preferred 5' nucleic acid sequences include dSSU 5', 
PetHSP70 5', and GmHSP17.9 5'. 

The recombinant vector may further comprise a nucleic acid sequence encoding a 
transit peptide. This peptide may be useful for directing a protein to the extracellular 
space, a chloroplast, or to some other compartment inside or outside of the cell (see, e.g., 
European Patent Application Publication Number 0218571). 

The structural nucleic acid sequence in the recombinant vector may comprise 
introns. The introns may be heterologous with respect to the structural nucleic acid 
sequence. Preferred introns include the rice actin intron and the corn HSP70 intron. 
Fusion Proteins 
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Any of the above described structural nucleic acid sequences, and modified forms 
thereof, may be linked with additional nucleic acid sequences to encode fusion proteins. 
The additional nucleic acid sequence preferably encodes at least 1 amino acid, peptide, or 
protein. Production of fusion proteins is routine in the art and many possible fusion 
combinations exist. 

For instance, the fusion protein may provide a "tagged" epitope to facilitate 
detection of the fusion protein, such as GST, GFP, FLAG, or polyHIS. Such fusions 
preferably encode between 1 and 50 amino acids, more preferably between 5 and 30 
additional amino acids, and even more preferably between 5 and 20 amino acids. 

Alternatively, the fusion may provide regulatory, enzymatic, cell signaling, or 
intercellular transport functions. For example, a sequence encoding a chloroplast transit 
peptide may be added to direct a fusion protein to the chloroplasts within a plant cell. 
Such fusion partners preferably encode between 1 and 1000 additional amino acids, more 
preferably between 5 and 500 additional amino acids, and even more preferably between 
10 and 250 amino acids. 
Probes and Primers 

Short nucleic acid sequences having the ability to specifically hybridize to 
complementary nucleic acid sequences may be produced and utilized in the present 
invention. These short nucleic acid molecules may be used as probes to identify the 
presence of a complementary nucleic acid sequence in a given sample. Thus, by 
constructing a nucleic acid probe which is complementary to a small portion of a 
particular nucleic acid sequence, the presence of that nucleic acid sequence may be 
detected and assessed. 
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Use of these probes may greatly facilitate the identification of transgenic plants 
which contain the presently disclosed nucleic acid molecules. The probes may also be 
used to screen cDNA or genomic libraries for additional nucleic acid sequences related or 
sharing homology to the presently disclosed promoters and structural nucleic acid 
sequences. 

Alternatively, the short nucleic acid sequences may be used as oligonucleotide 
primers to amplify or mutate a complementary nucleic acid sequence using PCR 
technology. These primers may also facilitate the amplification of related complementary 
nucleic acid sequences (e.g. related nucleic acid sequences from other species). 

The short nucleic acid sequences may be used as probes and specifically as PCR 
probes. A PCR probe is a nucleic acid molecule capable of initiating a polymerase 
activity while in a double-stranded structure with another nucleic acid. Various methods 
for determining the structure of PCR probes and PCR techniques exist in the art. 
Computer generated searches using programs such as Primer3 (www.genome.wi.mit.edu/ 
cgi-bin/primer/primer3 .cgi), STSPipeline (www-genome.wi.mit.edu/cgi-bin/www.STS_ 
Pipeline), or GeneUp (Pesole, et al , BioTechniques 25:1 12-123, 1998), for example, can 
be used to identify potential PCR primers. 

The primer or probe is generally complementary to a portion of a nucleic acid 

sequence that is to be identified, amplified, or mutated. The primer or probe should be of 

sufficient length to form a stable and sequence-specific duplex molecule with its 

complement. The primer or probe preferably is about 10 to about 200 nucleotides long, 

more preferably is about 10 to about 100 nucleotides long, even more preferably is about 

* 

10 to about 50 nucleotides long, and most preferably is about 14 to about 30 nucleotides 
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long. 

The primer or probe may be prepared by direct chemical synthesis, by PCR (See, 
for example, U.S. Patents 4,683,195, and 4,683,202), or by excising the nucleic acid 
specific fragment from a larger nucleic acid molecule. 
Sequence Analysis 

In the present invention, sequence similarity or identity is preferably determined 
using the "Best Fit" or "Gap" programs of the Sequence Analysis Software Package™ 
(Version 10; Genetics Computer Group, Inc., Center, Madison, WI). "Gap" utilizes the 
algorithm of Needleman and Wunsch (Needleman and Wunsch, Journal of Molecular 
Biology 48:443-453, 1970) to find the alignment of two sequences that maximizes the 
number of matches and minimizes the number of gaps. "BestFit" performs an optimal 
alignment of the best segment of similarity between two sequences. Optimal alignments 
are found by inserting gaps to maximize the number of matches using the local homology 
algorithm of Smith and Waterman (Smith, et al } In: Genetic Engineering: Principles and 
Methods, Setlow et aL, Eds., Plenum Press, N.Y., 1-32, 1981; Smith and Waterman 
Advances in Applied Mathematics, 2:482-489, 1981). 

The Sequence Analysis Software Package described above contains a number of 
other useful sequence analysis tools for identifying homologues of the presently disclosed 
nucleotide and amino acid sequences. For example, the "BLAST" program (Altschul, et 
aL, Journal of Molecular Biology 215: 403-410, 1990) searches for sequences similar to a 
query sequence (either peptide or nucleic acid) in a specified database {e.g., sequence 
databases maintained at the National Center for Biotechnology Information (NCBI) in 
Bethesda, Maryland, USA); "FastA" (Lipman and Pearson, Science, 227:1435-1441, 
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1985; see also Pearson and Lipman, Proceedings of the National Academy of Sciences 
USA 85, 2444-2448, 1988; Pearson, Methods in Enzymology , (R. Doolittle, ed.), 183, 63- 
98, Academic Press, San Diego, California, USA, 1990) performs a Pearson and Lipman 
search for similarity between a query sequence and a group of sequences of the same type 
(nucleic acid or protein); "TfastA" performs a Pearson and Lipman search for similarity 
between a protein query sequence and any group of nucleotide sequences (it translates the 
nucleotide sequences in all six reading frames before performing the comparison); 
"FastX" performs a Pearson and Lipman search for similarity between a nucleotide query 
sequence and a group of protein sequences, taking frameshifts into account. "TfastX" 
performs a Pearson and Lipman search for similarity between a protein query sequence 
and any group of nucleotide sequences, taking frameshifts into account (it translates both 
strands of the nucleic acid sequence before performing the comparison). 
Transgenic Plants and Plant Cells 

The invention also includes and provides transformed plant cells which comprise 
a nucleic acid molecule of the present invention. 

Preferred nucleic acid sequences of the present invention include, without 
limitation, recombinant vectors, structural nucleic acid sequences, promoters, and other 
regulatory elements, are described above. A promoter preferably comprises a nucleic 
acid sequence that hybridizes under stringent conditions with a nucleic acid sequence 
selected from the group consisting of SEQ ID NO:l through SEQ ID NO: 5 7,467, any 
complement thereof; and any fragment thereof; or exhibits 85% or greater identity, and 
more preferably at least 86 or greater, 87 or greater, 88 or greater, 89 or greater, 90 or 
greater, 91 or greater, 92 or greater, 93 or greater, 94 or greater, 95 or greater, 96 or 
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greater, 97 or greater, 98 or greater, or 99% or greater identity to a nucleic acid sequence 
selected from the group consisting of SEQ ID NO:l through SEQ ID NO:57,467, any 
complement thereof; and any fragment thereof. A promoter most preferably comprises 
SEQ ID NO:l through SEQ ID NO:57,467. 

Methods for preparing such recombinant vectors are well known in the art. For 
example, methods for making recombinant vectors particularly suited to plant 
transformation are described in U.S. Patent Nos. 4,971,908, 4,940,835, 4,769,061 and 
4,757,01 1 . These vectors have also been reviewed (Rodriguez et aL, Vectors: A Survey 
of Molecular Cloning Vectors and Their Uses, Butterworths, Boston, 1988; Glick, et al, 
Methods in Plant Molecular Biology and Biotechnology, CRC Press, Boca Raton, Fla., 
1993) and are described above. 

Typical vectors useful for expression of nucleic acids in plant cells are well 
known in the art and include vectors derived from the tumor-inducing (Ti) plasmid of 
Agrobacterium tumefaciens (Rogers, et al, Meth In Enzymol, 153: 253-277, 1987). 
Other recombinant vectors useful for plant transformation, have also been described 
(Fromm etal, Proa Natl Acad Sci. USA, 82(17): 5824-5828, 1985). Elements of such 
recombinant vectors are discussed above. 

The transformed plant or cell may generally be any plant or cell that is compatible 
with the present invention. The plant or cell preferably is alfalfa, apple, banana, barley, 
bean, broccoli, cabbage, carrot, castorbean, celery, citrus, clover, coconut, coffee, corn, 
cotton, cucumber, garlic, grape, linseed, melon, oat, olive, onion, palm, parsnip, pea, 
peanut, pepper, potato, radish, rapeseed, rice, rye, sorghum, soybean, spinach, strawberry, 
sugarbeet, sugarcane, sunflower, tobacco, tomato, or wheat. The transformed plant or 
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cell is more preferably rice, sorghum, maize, barley, wheat, canola, soybean, or maize; 
even more preferably a rice, sorghum, maize, barley, or wheat; and most preferably is 
rice. The rice plant or cell is preferably Oryza sativa L (japonica type), and more 
preferably Oryza sativa L (japonica type), cv. Nipponbare. 

Typical vectors useful for expression of nucleic acids in cells and higher plants 
are well known in the art and include vectors derived from the tumor-inducing (Ti) 
plasmid of Agrobacterium tumefaciens (Rogers, et al, Meth In Enzymol, 153: 253-277, 
1987). Other recombinant vectors useful for plant transformation, have also been 
described (Fromm et al, Proc. Natl. Acad ScL USA, 82(17): 5824-5828, 1985). 
Elements of such recombinant vectors are discussed above. 
Method for Preparing Transformed Cells 

The invention is also directed to a method of producing transformed cells which 
comprise, in a 5' to 3' orientation, a promoter operably linked to a heterologous structural 
nucleic acid sequence. Other sequences may also be introduced into the cell along with 
the promoter and structural nucleic acid sequence. These other sequences may include 3' 
transcriptional terminators, 3' polyadenylation signals, other untranslated sequences, 
transit or targeting sequences, selectable markers, enhancers, and operators. 

Preferred recombinant vectors, structural nucleic acid sequences, promoters, and 
other regulatory elements are described above. The promoter preferably has a nucleic 
acid sequence that hybridizes under stringent conditions with SEQ ID NO:l through SEQ 
ID NO:57,467, or any complement thereof; or exhibits 85% or greater identity, and more 
preferably at least 86 or greater, 87 or greater, 88 or greater, 89 or greater, 90 or greater, 
91 or greater, 92 or greater, 93 or greater, 94 or greater, 95 or greater, 96 or greater, 97 or 
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greater, 98 or greater, or 99% or greater identity to SEQ ID NO: 1 through SEQ ID 
NO:57,467. 

The method generally comprises the steps of selecting a suitable host cell, 
transforming the host cell with a recombinant vector, and obtaining the transformed host 
cell. 

There are many methods for introducing nucleic acids into plant cells. Suitable 
methods include bacterial infection {e.g. Agrobacterium), binary bacterial artificial 
chromosome vectors, direct delivery of DNA {e.g. via PEG-mediated transformation, 
desiccation/inhibition-mediated DNA uptake, electroporation, agitation with silicon 
carbide fibers, and acceleration of DNA coated particles, etc. (reviewed in Potrykus, et 
ah, Ann. Rev. Plant Physiol. Plant Mol. Biol, 42: 205, 1991). 

Technology for introduction of DNA into cells is well known to those of skill in 
the art. These methods can generally be classified into four categories: (1) chemical 
methods (Graham and Van der Eb, Virology, 54(2): 536-539, 1973; Zatloukal, et al, Ann. 
NY. Acad. Sci., 660: 136-153, 1992); (2) physical methods such as microinjection 
(Capecchi, Cell, 22(2): 479-488, 1980), electroporation (Wong and Neumann, Biochim. 
Biophys. Res. Commun., 107(2): 584-587, 1982; Fromm etai, Proc. Natl. Acad. Sci. 
USA, 82(17): 5824-5828, 1985; U.S. Patent No. 5,384,253) and particle acceleration 
(Johnston and Tang, Methods Cell Biol., 43(A): 353-365, 1994; Fynan et al., Proc. Natl. 
Acad. Sci. USA, 90(24): 1 1478-1 1482, 1993); (3) viral vectors (Clapp, Clin. Perinatol, 
20(1): 155-168, 1993; Lu, et al., J. Exp. Med, 178(6): 2089-2096, 1993; Eglitis and 
Anderson, Biotechniques, 6(7): 608-614, 1988); and (4) receptor-mediated mechanisms 
(Curiel et al., Hum. Gen. Ther., 3(2): 147-1 54, 1992; Wagner, etai, Proc. Natl. Acad. Sci. 
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USA, 89(13): 6099-6103, 1992). Alternatively, nucleic acids can be directly introduced 
into pollen by directly injecting a plant's reproductive organs (Zhou, et aL, Methods in 
Enzymology, 101: 433, 1983; Hess, Intern Rev. CytoL, 107: 367, 1987; Luo, etal, Plant 
Mol BioL Reporter, 6: 165, 1988; Pena, et aL, Nature, 325: 274, 1987). The nucleic acids 
may also be injected into immature embryos (Neuhaus, et aL, Theor. Appl. Genet., 75: 
30, 1987). 

The recombinant vector used to transform the host cell typically comprises, in a 5' 
to 3' orientation: a promoter to direct the transcription of a structural nucleic acid 
sequence, a structural nucleic acid sequence, a 3' transcriptional terminator, and a 3 5 
polyadenylation signal. The recombinant vector may further comprise untranslated 
nucleic acid sequences, transit and targeting nucleic acid sequences, selectable markers, 
enhancers, or operators. 

Suitable recombinant vectors, structural nucleic acid sequences, promoters, and 
other regulatory elements include, without limitation, those described above. 

The regeneration, development, and cultivation of plants from transformed plant 
protoplast or explants is well taught in the art (Weissbach and Weissbach, Methods for 
Plant Molecular Biology, (Eds.), Academic Press, Inc., San Diego, CA, 1988; Horsch et 
aL, Science, 227: 1229-1231, 1985). In this method, transformants are generally cultured 
in the presence of a selective media which selects for the successfully transformed cells 
and induces the regeneration of plant shoots (Fraley et aL, Proc. Natl. Acad. Sci. U.S.A., 
80: 4803, 1983). These shoots are typically obtained within two to four months. 

The shoots are then transferred to an appropriate root-inducing medium 
containing the selective agent and an antibiotic to prevent bacterial growth. Many of the 
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shoots will develop roots. These are then transplanted to soil or other media to allow the 
continued development of roots. The method, as outlined, will generally vary depending 
on the particular plant strain employed. 

The regenerated transgenic plants are self-pollinated to provide homozygous 
transgenic plants. Alternatively, pollen obtained from the regenerated transgenic plants 
may be crossed with non-transgenic plants, preferably inbred lines of agronomically 
important species. Conversely, pollen from non-transgenic plants may be used to 
pollinate the regenerated transgenic plants. 

The transgenic plant may pass along the transformed nucleic acid sequence to its 
progeny. The transgenic plant is preferably homozygous for the transformed nucleic acid 
sequence and transmits that sequence to all of its offspring upon as a result of sexual 
reproduction. Progeny may be grown from seeds produced by the transgenic plant. 
These additional plants may then be self-pollinated to generate a true breeding line of 
plants. 

The progeny from these plants are evaluated, among other things, for gene 
expression. The gene expression may be detected by several common methods such as 
western blotting, northern blotting, immunoprecipitation, and ELISA. 

Methods for transforming dicots, primarily by use of Agrobacterium tumefaciens 
and obtaining transgenic plants have been published for cotton (U.S. Patent No. 
5,004,863; U.S. Patent No. 5,159,135; U.S. Patent No. 5,518,908); soybean (U.S. Patent 
No. 5,569,834; U.S. Patent No. 5,416,011; McCabe, et al, Biotechnolgy, 6: 923, 1988; 
Christou etal, Plant Physiol 57:671-674 (1988)); Brassica (U.S. Patent No. 5,463,174); 
peanut (Cheng et al, Plant Cell Rep. 75:653-657 (1996), McKently et al, Plant Cell Rep. 
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74/699-703 (1995)); papaya; and pea (Grant et al, Plant Cell Rep. 75:254-258 (1995)). 

Transformation of monocotyledons using electroporation, particle bombardment 
and Agrobacterium have also been reported. Transformation and plant regeneration have 
been achieved in asparagus (Bytebier et al, Proc. Natl Acad. Sci. (USA) 84:5354 
(1987)); barley (Wan and Lemaux, Plant Physiol 104:31 (1994)); maize (Rhodes et al, 
Science 240:204 (1988); Gordon-Kamm et al, Plant Cell 2:603-618 (1990); Fromm et 
al, Bio/Technology 5:833 (1990); Koziel et al, Bio/Technology 77:194 (1993); 
Armstrong et al, Crop Science 55:550-557 (1995)); oat (Somers et al, Bio/Technology 
70:1589 (1992)); orchard grass (Horn et al, Plant Cell Rep, 7:469 (1988)); rice 
(Toriyama et al, Theor Appl Genet. 205:34 (1986); Part et al, Plant Mol Biol. 32:1135- 
1 148 (1996); Abedinia et al, Aust J. Plant Physiol. 24:133-141 (1997); Zhang and Wu, 
Theor. Appl. Genet 75:835 (1988); Zhang et al, Plant Cell Rep. 7:379 (1988); Battraw 
and Hall, Plant Sci. 55:191-202 (1992); Christou et al, Bio/Technology 9:957 (1991)); 
rye (De la Pena et al, Nature 325:214 (1987)); sugarcane (Bower and Birch, Plant J. 
2:409 (1992)); tall fescue (Wang et al, Bio/Technology 10:691 (1992)) and wheat (Vasil 
etal, Bio/Technology 10:661 (1992); U.S. Patent No. 5,631,152). 
Other Transformed Organisms 

Any of the above described promoters and structural nucleic acid sequences may 
be introduced into any cell or organism such as a mammalian cell, mammal, fish cell, 
fish, bird cell, bird, algae cell, algae, fungal cell, fungi, or bacterial cell. Preferred hosts 
and transformants include: fungal cells such as Aspergillus, yeasts, mammals 
(particularly bovine and porcine), insects, bacteria and algae. 

The transformed cell or organism is preferably prokaryotic, more preferably a 
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bacterial cell, even more preferably a Agrobacterium, Bacillus, Escherichia, 
Pseudomonas cell, and most preferably is an Escherichia coli cell. Alternatively, the 
transformed organism is preferably a yeast or fungal cell. The yeast cell is preferably a 
Saccharomyces cerevisiae, Schizosaccharomyces pombe, or Pichia pastoris. 

Methods to transform such cells or organisms are known in the art (EP 0238023; 
Yelton etal, Proc. Natl Acad. Sci. (U.S.A.), 57:1470-1474 (1984); Malardier etal, 
Gene, 75:147-156 (1989); Becker and Guarente, In: Abelson and Simon (eds.,), Guide to 
Yeast Genetics and Molecular Biology, Methods Enzymol, Vol. 194, pp. 182-187, 
Academic Press, Inc., New York; Ito et al.,J. Bacteriology, 153:163 (1983); Hinnen et 
al, Proc. Natl Acad. Set (U.S.A.), 75:1920 (1978); Bennett and LaSure (eds.), More 
Gene Manipulations in Fungi, Academic Press, CA (1991)). Methods to produce 
proteins of the present invention from such organisms are also known (Kudla et al , 
EMBO, 9:1355-1364 (1990); Jarai and Buxton, Current Genetics, 25:2238-2244 (1994); 
Verdier, Yeast, 5:271-297 (1990); MacKenzie et al, Journal of Gen. Microbiol, 
739:2295-2307 (1993); Hartl etal, TIBS, 79:20-25 (1994); Bergeron etal, TIBS, 79:124- 
128 (1994); Demolder et al,J. Biotechnology, 32:179-189 (1994); Craig, Science, 
250:1902-1903 (1993); Gething and Sambrook, Nature, 355:33-45 (1992); Puig and 
Gilbert, J. Biol. Chem., 259:7764-7771 (1994); Wang and Tsou, FASEB Journal, 7:1515- 
1517 (9193); Robinson et al, Bio/Technology, 7:381-384 (1994); Enderlin and 
Ogrydziak, Yeast, 70:67-79 (1994); Fuller^/., Proc. Natl Acad. Sci. (U.S.A.), 55:1434- 
1438 (1989); Julius etal, Cell, 37:1075-1089 (1984); Julius etal, Cell, 32:839-852 
(1983)). 

Exemplary Uses 
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The presently disclosed promoter sequences may be used as genetic markers and 
employed in genetic mapping studies using linkage analysis. A genetic linkage map 
shows the relative locations of specific DNA markers along a chromosome. Maps are 
used for the identification of genes associated with genetic diseases or phenotypic traits, 
comparative genomics, and as a guide for physical mapping. Through genetic mapping, a 
fine scale linkage map can be developed using DNA markers, and, then, a genomic DNA 
library of large-sized fragments can be screened with molecular markers linked to the 
desired trait. In a preferred embodiment of the present invention, a genomic library is 
screened with the promoter sequences of the present invention. 

Mapping marker locations is based on the observation that two markers located 
near each other on the same chromosome will tend to be passed together from parent to 
offspring. During gamete production, DNA strands occasionally break and rejoin in 
different places on the same chromosome or on the homologous chromosome. The closer 
the markers are to each other, the more tightly linked and the less likely a recombination 
event will fall between and separate them. Recombination frequency thus provides an 
estimate of the distance between two markers. 

In segregating populations, target genes have been reported to have been placed 
within an interval of 5-10 cM with a high degree of certainty (Tanksley et aL, Trends in 
Genetics 11 (2) :63-6$ (1995)). The markers defining this interval are used to screen a 
larger segregating population to identify individuals derived from one or more gametes 
containing a crossover in the given interval. Such individuals are useful in orienting 
other markers closer to the target gene. Once identified, these individuals can be 
analyzed in relation to all molecular markers within the region to identify those closest to 
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the target. 

Markers of the present invention can be employed to construct linkage maps and 
to locate genes with qualitative and quantitative effects. The genetic linkage of additional 
marker molecules can be established by a genetic mapping model such as, without 
limitation, the flanking marker model reported by Lander and Botstein, Genetics, 
727:185-199 (1989), and the interval mapping, based on maximum likelihood methods 
described by Lander and Botstein, Genetics, 727:185-199 (1989), and implemented in the 
software package MAPMAKER/QTL (Lincoln and Lander, Mapping Genes Controlling 
Quantitative Traits Using MAPMAKER/QTL, Whitehead Institute for Biomedical 
Research, Massachusetts, (1990)). Additional software includes Qgene, Version 2.23 
(1996), Department of Plant Breeding and Biometry, 266 Emerson Hall, Cornell 
University, Ithaca, NY). Use of the Qgene software is a particularly preferred approach. 

A maximum likelihood estimate (MLE) for the presence of a marker is calculated, 
together with an MLE assuming no QTL effect, to avoid false positives. A log 10 of an 
odds ratio (LOD) is then calculated as: LOD = log 10 (MLE for the presence of a 
QTL/MLE given no linked QTL). 

The LOD score essentially indicates how much more likely the data are to have 
arisen assuming the presence of a QTL than in its absence. The LOD threshold value for 
avoiding a false positive with a given confidence, say 95%, depends on the number of 
markers and the length of the genome. Graphs indicating LOD thresholds are set forth in 
Lander and Botstein, Genetics, 727:185-199 (1989), and further described by Arus and 
Moreno-Gonzalez, Plant Breeding, Hayward, Bosemark, Romagosa (eds.) Chapman & 
Hall, London, pp. 314-331 (1993). 
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Additional models can be used. Many modifications and alternative approaches 
to interval mapping have been reported, including the use of non-parametric methods 
(Kruglyak and Lander, Genetics, 739/1421-1428 (1995)). Multiple regression methods 
or models can be also be used, in which the trait is regressed on a large number of 
markers (Jansen, Biometrics in Plant Breed, van Oijen, Jansen (eds.) Proceedings of the 
Ninth Meeting of the Eucarpia Section Biometrics in Plant Breeding, The Netherlands, 
pp. 1 16-124 (1994); Weber and Wricke, Advances in Plant Breeding, Blackwell, Berlin, 
16 (1994). Procedures combining interval mapping with regression analysis, whereby the 
phenotype is regressed onto a single putative QTL at a given marker interval, and at the 
same time onto a number of markers that serve as 'cofactors, 1 have been reported by 
Jansen and Stam, Genetics, 735:1447-1455 (1994) and Zeng, Genetics, 73(5:1457-1468 
(1994). Generally, the use of cofactors reduces the bias and sampling error of the 
estimated QTL positions (Utz and Melchinger, Biometrics in Plant Breeding, van Oijen, 
Jansen (eds.) Proceedings of the Ninth Meeting of the Eucarpia Section Biometrics in 
Plant Breeding, The Netherlands, pp. 195-204 (1994), thereby improving the precision 
and efficiency of QTL mapping (Zeng, Genetics, 735:1457-1468 (1994)). These models 
can be extended to multi-environment experiments to analysis genotype-environment 
interactions (Jansen et ah, Theo, Appl Genet 97:33-37 (1995)). 

Selection of an appropriate mapping population is important to map construction. 
The choice of appropriate mapping population depends on the type of marker systems 
employed (Tanksley et al, J. P. Gustafson and R. Appels (eds.), Plenum Press, New York, 
pp. 157-173 (1988)). Consideration must be given to the source of parents (adapted vs. 
exotic) used in the mapping population. Chromosome pairing and recombination rates 
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can be severely disturbed (suppressed) in wide crosses (adapted x exotic) and generally 
yield greatly reduced linkage distances. Wide crosses will usually provide segregating 
populations with a relatively large array of polymorphisms when compared to progeny in 
a narrow cross (adapted x adapted). 

An F2 population is the first generation of selfing after the hybrid seed is 
produced. Usually a single Fl plant is selfed to generate a population segregating for all 
the genes in Mendelian (1 :2:1) fashion. Maximum genetic information is obtained from a 
completely classified F2 population using a codominant marker system (Mather, 
Measurement of Linkage in Heredity: Methuen and Co., (1938)). In the case of dominant 
markers, progeny tests (e.g., F3, BCF2) are required to identify the heterozygotes, thus 
making it equivalent to a completely classified F2 population. However, this procedure is 
often prohibitive because of the cost and time involved in progeny testing. Progeny 
testing of F2 individuals is often used in map construction where phenotypes do not 
consistently reflect genotype (e.g., disease resistance) or where trait expression is 
controlled by a QTL. Segregation data from progeny test populations (e.g., F3 or BCF2) 
can be used in map construction. Marker-assisted selection can then be applied to cross 
progeny based on marker-trait map associations (F2, F3), where linkage groups have not 
been completely disassociated by recombination events (i.e., maximum disequilibrium). 

Recombinant inbred lines (RIL) (genetically related lines; usually >F5, developed 
from continuously selfing F2 lines towards homozygosity) can be used as a mapping 
population. Information obtained from dominant markers can be maximized by using 
RIL because all loci are homozygous or nearly so. Under conditions of tight linkage (i.e., 
about <10% recombination), dominant and co-dominant markers evaluated in RIL 
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populations provide more information per individual than either marker type in backcross 
populations (Reiter, Proc. Natl Acad. Set USA SP:1477-1481 (1992)). However, as the 
distance between markers becomes larger (i.e., loci become more independent), the 
information in RIL populations decreases dramatically when compared to codominant 
markers. 

Backcross populations (e.g., generated from a cross between a successful variety 
(recurrent parent) and another variety (donor parent) carrying a trait not present in the 
former) can be utilized as a mapping population. A series of backcrosses to the recurrent 
parent can be made to recover most of its desirable traits. Thus a population is created 
consisting of individuals nearly like the recurrent parent but each individual carries 
varying amounts or mosaic of genomic regions from the donor parent. Backcross 
populations can be useful for mapping dominant markers if all loci in the recurrent parent 
are homozygous and the donor and recurrent parent have contrasting polymorphic marker 
alleles (Reiter et al, Proc. Natl Acad Set USA 59:1477-1481 (1992)). Information 
obtained from backcross populations using either codominant or dominant makers is less 
than that obtained from F2 populations because one, rather than two, recombinant 
gametes are sampled per plant. Backcross populations, however, are more informative (at 
low marker saturation) when compared to RILs as the distance between linked loci 
increases in RIL populations (i.e., about 0.15% recombination). Increased recombination 
can be beneficial for resolution of tight linkages, but may be undesirable in the 
construction of maps with low marker saturation. 

Near-isogenic lines (NIL)(created by many backcrosses to produce an array of 
individuals that are nearly identical in genetic composition except for the trait or genomic 
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region under interrogation) can be used as a mapping population. In mapping with NILs, 
only a portion of the polymorphic loci are expected to map to a selected region. 

Bulk segregant analysis (BSA) is a method developed for the rapid identification 
of linkage between markers and traits of interest (Michelmore et al , Proc. Natl Acad. 
Sci. USA 55:9828-9832 (1991)). In BSA, two bulked DNA samples are drawn from a 
segregating population originating from a single cross. These bulks contain individuals 
that are identical for a particular trait (resistant or susceptible to particular disease) or 
genomic region but arbitrary at unlinked regions (i.e., heterozygous). Regions unlinked 
to the target region will not differ between the bulked samples of many individuals in 
BSA. 

It is understood that one or more of the nucleic acid molecules of the present 
invention may in one embodiment be used as markers in genetic mapping. In a preferred 
embodiment, nucleic acid molecules of the present invention may in one embodiment be 
used as markers with rice. 

Nucleic acid molecules of the present invention can be used in comparative 
mapping (physical and genetic) and to isolate molecules from other cereals based on the 
syntenic relationship between cereals. Comparative mapping within families provides a 
method to the degree of sequence conservation, gene order, ploidy of species, ancestral 
relationships and the rates at which individual genomes are evolving. Comparative 
mapping has been carried out by cross-hybridizing molecular markers across species 
within a given family. 

In a preferred embodiment, the nucleic acid molecules of the present invention 
can be utilized to isolate corresponding syntenic regions in non-rice plants (Bennetzen 
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and Freeling, Trends in Genet., 9(8):259-261 (1993); Ann et al, Mol. Gen. Genet., 241(5- 
6):483-490 (1993); Schwarzacher, Cur. Opin. Genet. & Devel, 4(6): 868-874 (1994); 
Kurata et al, Bio/Technology, 72:276-278 (1994); Kilian et al.,Nucl. Acids Res., 
23(1 4^:2729-2733 (1995); Bennett, Symp. Soc. Exp. Biol, 50:45-52 (1996); Hu et al, 
Genetics, 7^2(^:1021-1031 (1996); Kilian, Plant Mol. Biol, 35:187-195 (1997); 
Bennetzen and Freeling, Genome Res., 7(^:301-306 (1997); Foote et al, Genetics, 
747(2):801-807 (1997); Gallego et al, Genome, 47(3j:328-336 (1998)). Gale and Devos, 
Proc. Natl. Acad. Sci. USA 95:1971-1974 (1998); Bennetzen etal.,Proc. Natl Acad. Sci. 
USA, 95:1975-1978 (1998); Messing and Llaca, Proc. Natl Acad. Sci. USA 95:2017- 
2020 (1998); McCouch, Proc. Natl. Acad. Sci. USA, 95:1983-1985 (1998); Goff, Curr. 
Opin. Plant Biol 2:85-89 (1999); Bailey etal, Theor. Appl Genet., 95:281-284 (1999); 
Zhang et al, Proc. Natl Acad. Sci. USA, 97:8675-8679 (1994); Yano and Sasaki, Plant 
Mol Biol, 35:145-153 (1997); Leister etal, Proc. Natl. Acad. Sci. USA, 95:370-375 
(1998); Lin et al, Phytopathology 86(1 1):\ 156-1 159 (1996); Havukkala, Curr. Opin. 
Genet. Dev., 96:71 1-713 (1996); and Lee, The Society for Experimental Biology, pp. 31- 
38 (1996). Synteny between rice and barley has recently been reported in the genomic 
region carrying malting quality Quantitative Trait Loci (QTL) (Kleinhofs et al, Genome 
47:373-380 (1998)). Likewise, mapping of the liguless region of sorghum, a region 
containing a developmental control gene, was facilitated using molecular markers from a 
syntenic region of the rice genome (Christou et al, Genetics 148: 1983- 1992 (1998)). 

In a particularly preferred embodiment, the nucleic acid molecules of the present 
invention that define a genomic region in rice plants associated with a desirable 
phenotype are utilized to obtain corresponding syntenic regions in non-rice plants. A 
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region can be defined either physically or genetically. In an even more preferred 
embodiment, the nucleic acid molecules of the present invention that define a genomic 
region in rice plants associated with a desirable phenotype are utilized to obtain 
corresponding syntenic regions in corn plants. A region can be defined either physically 
or genetically. 

One or more of the nucleic acids molecules may be used to define a physical 
genomic region. For example, two nucleic acid molecules of the present invention can 
act to define a physical genomic region that lies between them. Moreover, for example, a 
physical genomic region may be defined by a distance relative to a nucleic acid molecule. 
In a preferred embodiment of the present invention, the defined physical genomic region 
is less than about 1,000 kb, more preferably less than about 500 kb, even more preferably 
less than about 100 kb or less than about 50 kb. 

One or more of the nucleic acids molecules may be used to define a genomic 
region by its genetic distance from one or more of the nucleic acid molecules of the 
present invention. In a preferred embodiment of the present invention, the genomic 
region is defined by its linkage to a nucleic acid molecule of the present invention. In 
such a preferred embodiment, the genomic region that is defined by one or more nucleic 
acid molecules of the present invention is located within about 50 centimorgans, more 
preferably within about 20 centimorgans, even more preferably with about 10, about 5 or 
about 2 centimorgans of the trait or marker at issue. 

In another particularly preferred embodiment, two or more nucleic acid molecules 
of the present invention derived from rice plants that flank a genomic region of interest in 
rice plants are used to isolate the syntenic region in another cereal, more preferably 
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maize, sorghum, barley, or wheat. Regions of interest in rice include, without limitation, 
those regions that are associated with a commercially desirable phenotype in rice. In 
another particularly preferred embodiment the desirable phenotype in rice is the result of 
a quantitative trait locus (QTL) present in the region. 
5 One exemplary approach to isolate syntenic genomic regions is as follows. 

Nucleic acid sequences derived from rice of the present invention can be used to select 
large insert clones from a total genomic DNA library of a related species such as maize, 
sorghum, barley, or wheat. Any appropriate method to screen the genomic library with a 
nucleic acid molecule of the present invention may be used to select the required clones 
10 (See, for example, Birren et al, Detecting Genes: A Laboratory Manual, Cold Spring 
Harbor, New York, NY (1998)). For example, direct hybridization of a nucleic acid 
molecule of the present invention to mapping filters comprising the genomic DNA of the 
syntenic species can be used to select large insert clones from a total genomic DNA 
library of a related species. The selected clones can then be used to physically map the 
15 region in the target species. An advantage of this method for comparative mapping is that 
no mapping population or linkage map of the target species is needed and the clones may 
also be used in other closely related species. By comparing the results obtained by 
genetic mapping in model plants, with those from other species, similarities of genomic 
structure among plants species can be established. Cross-hybridization of RFLP markers 
20 have been reported and conserved gene order has been established in many studies. Such 
macroscopic synteny is utilized for the estimation of correspondence of loci among these 
crops. These loci include not only Mendelian genes but also Quantitative Trait Loci 
(QTL) (Mohan et al, Molecular Breeding 3:87-103 (1997)). Other methods to isolate 



syntenic nucleic acid molecules may be used. 

It is understood that markers of the present invention may be used in comparative 
mapping. In a preferred embodiment the markers of present invention may be used in the 
comparative mapping of cereals, more preferably maize, barley, sorgham, and wheat. 

It is understood that markers of the present invention may be used to isolate 
promoters and other nucleic acid sequences from other cereals based on the syntenic 
relationship between such cereals. In a preferred embodiment the cereal is selected from 
the group of maize, sorgham, barley, and wheat. 

The nucleic acid molecules of the present invention can be used to identify 
polymorphisms. In one embodiment, one or more of the nucleic acid molecules may be 
employed as a marker nucleic acid molecule to identify such polymorphism(s). 
Alternatively, such polymorphisms can be detected through the use of a marker nucleic 
acid molecule or a marker protein that is genetically linked to (i.e., a polynucleotide that 
co-segregates with) such polymorphism(s). In a preferred embodiment, the plant is 
selected from the group consisting of cereals, and more preferably rice, maize, barley, 
sorgham, and wheat. 

In an alternative embodiment, such polymorphisms can be detected through the 
use of a marker nucleic acid molecule that is physically linked to such polymorphism(s). 
For this purpose, marker nucleic acid sequences located within 1 mb of the 
polymorphism(s), and more preferably within 100 kb of the polymorphism(s), and most 
preferably within 10 kb of the polymorphism(s) can be employed. 

The genomes of animals and plants naturally undergo spontaneous mutation in the 
course of their continuing evolution (Gusella, Ann. Rev. Biochem. 55:831-854 (1986)). A 
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"polymorphism" is a variation or difference in the sequence of the gene or its flanking 
regions that arises in some of the members of a species. The variant sequence and the 
"original" sequence co-exist in the species' population. In some instances, such co- 
existence is in stable or quasi-stable equilibrium. 

A polymorphism is thus said to be "allelic," in that, due to the existence of the 
polymorphism, some members may have the original sequence (/. e. , the original "allele") 
whereas other members may have the variant sequence {i.e., the variant "allele"). In the 
simplest case, only one variant sequence may exist, and the polymorphism is thus said to 
be di-allelic. In other cases, the population may contain multiple alleles, and the 
polymorphism is termed tri-allelic, etc. A single gene may have multiple different 
unrelated polymorphisms. For example, it may have a di-allelic polymorphism at one 
site, and a multi-allelic polymorphism at another site. 

The variation that defines the polymorphism may range from a single nucleotide 
variation to the insertion or deletion of extended regions within a gene. In some cases, 
the DNA sequence variations are in regions of the genome that are characterized by short 
tandem repeats (STRs) that include tandem di- or tri-nucleotide repeated motifs of 
nucleotides. Polymorphisms characterized by such tandem repeats are referred to as 
"variable number tandem repeat" ("VNTR") polymorphisms. VNTRs have been used in 
identity analysis (Weber, U.S. Patent 5,075,217; Armour et al , FEES Lett. 507:113-115 
(1992); Jones et al, Eur. J. Haematol. 59:144-147 (1987); Horn et al, PCT Application 
WO91/14003; Jeffreys, European Patent Application 370,719; Jeffreys, U.S. Patent 
5,175,082; Jeffreys et al,Amer. 1 Hum. Genet. 39:1 1-24 (1986); Jeffreys et al, Nature 
316:76-79 (1985); Gray etal,Proc. R. Acad. Soc. Lond. 243:241-253 (1991); Moore et 
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al, Genomics 70:654-660 (1991); Jeffreys etal,Anim. Genet. 75:1-15 (1987); Hillel et 

al,Anim. Genet 20:145-155 (1989); Hillel etal, Genet. 724:783-789 (1990)). 

The detection of polymorphic sites in a sample of DN A may be facilitated through 

the use of nucleic acid amplification methods. Such methods specifically increase the 
5 concentration of polynucleotides that span the polymorphic site, or include that site and 

sequences located either distal or proximal to it. Such amplified molecules can be readily 

detected by gel electrophoresis or other means. 

The most preferred method of achieving such amplification employs the 

polymerase chain reaction ("PCR") (Mullis et al , Cold Spring Harbor Symp. Quant. 
10 Biol. 57:263-273 (1986); Erlich et al, European Patent Appln. 50,424; European Patent 

Appln. 84,796, European Patent Application 258,017, European Patent Appln. 237,362; 

Mullis, European Patent Appln. 201,184; Mullis, et al, U.S. Patent No. 4,683,202; 

Erlich., U.S. Patent No. 4,582,788; and Saiki etal, U.S. Patent No. 4,683,194), using 

primer pairs that are capable of hybridizing to the proximal sequences that define a 
15 polymorphism in its double-stranded form. 

In lieu of PCR, alternative methods, such as the "Ligase Chain Reaction" ("LCR") 

may be used (Barany, Proc. Natl Acad Sci. USA 55:189-193 (1991),. LCR uses two 

pairs of oligonucleotide probes to exponentially amplify a specific target. The sequences 

of each pair of oligonucleotides is selected to permit the pair to hybridize to abutting 
20 sequences of the same strand of the target. Such hybridization forms a substrate for a 

template-dependent ligase. As with PCR, the resulting products thus serve as a template 

in subsequent cycles and an exponential amplification of the desired sequence is 

obtained. 
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LCR can be performed with oligonucleotides having the proximal and distal 
sequences of the same strand of a polymorphic site. In one embodiment, either 
oligonucleotide will be designed to include the actual polymorphic site of the 
polymorphism. In such an embodiment, the reaction conditions are selected such that the 
oligonucleotides can be ligated together only if the target molecule either contains or 
lacks the specific nucleotide that is complementary to the polymorphic site present on the 
oligonucleotide. Alternatively, the oligonucleotides may be selected such that they do 
not include the polymorphic site (see, Segev, PCT Application WO 90/01069,). 

The "Oligonucleotide Ligation Assay" ("OLA") may alternatively be employed 
(Landegren et aL, Science 247:1077-1080 (1988)). The OLA protocol uses two 
oligonucleotides which are designed to be capable of hybridizing to abutting sequences of 
a single strand of a target. OLA, like LCR, is particularly suited for the detection of point 
mutations. Unlike LCR, however, OLA results in "linear" rather than exponential 
amplification of the target sequence. 

Nickerson et aL have described a nucleic acid detection assay that combines 
attributes of PCR and OLA (Nickerson et aL, Proc. Natl. Acad Sci. USA 57:8923-8927 
(1990)). In this method, PCR is used to achieve the exponential amplification of target 
DNA, which is then detected using OLA. In addition to requiring multiple, and separate, 
processing steps, one problem associated with such combinations is that they inherit all of 
the problems associated with PCR and OLA. 

Schemes based on ligation of two (or more) oligonucleotides in the presence of 
nucleic acid having the sequence of the resulting "di-oligonucleotide", thereby amplifying 
the di-oligonucleotide, are also known (Wu et aL, Genomics 4:560 (1989)), and may be 
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readily adapted to the purposes of the present invention. 

Other known nucleic acid amplification procedures, such as allele-specific 
oligomers, branched DNA technology, transcription-based amplification systems, or 
isothermal amplification methods may also be used to amplify and analyze such 

5 polymorphisms (Malek et al 9 U.S. Patent 5,130,238; Davey et al 9 European Patent 
Application 329,822; Schuster et al 9 U.S. Patent 5,169,766; Miller et al 9 PCT 
Application WO 89/06700; Kwoh et al 9 Proc. Natl Acad. Set USA 86:1 173-1 177 
(1989); Gingeras et al 9 PCT Application WO 88/10315; Walker et al, Proc. Natl Acad. 
Set USA 59:392-396 (1992)). 

10 The identification of a polymorphism can be determined in a variety of ways. By 

correlating the presence or absence of it in an plant with the presence or absence of a 
phenotype, it is possible to predict the phenotype of that plant. If a polymorphism creates 
or destroys a restriction endonuclease cleavage site, or if it results in the loss or insertion 
of DNA (e.g., a VNTR polymorphism), it will alter the size or profile of the DNA 

15 fragments that are generated by digestion with that restriction endonuclease. As such, 
individuals that possess a variant sequence can be distinguished from those having the 
original sequence by restriction fragment analysis. Polymorphisms that can be identified 
in this manner are termed "restriction fragment length polymorphisms" ("RFLPs"). 
RFLPs have been widely used in human and plant genetic analyses (Glassberg, UK 

20 Patent Application 2135774; Skolnick et al , Cytogen. Cell Genet. 32:58-67 (1982); 
Botstein et al 9 Ann. J. Hum. Genet. 52:314-331 (1980); Fischer et al PCT Application 
WO90/13668; Uhlen, PCT Application WO90/1 1369). 

Polymorphisms can also be identified by Single Strand Conformation 



Polymorphism (SSCP) analysis. The SSCP technique is a method capable of identifying 
most sequence variations in a single strand of DNA, typically between 150 and 250 
nucleotides in length (Elles, Methods in Molecular Medicine: Molecular Diagnosis of 
Genetic Diseases, Humana Press (1996)); Orita et al, Genomics 5:874-879 (1989)). 
Under denaturing conditions a single strand of DNA will adopt a conformation that is 
uniquely dependent on its sequence conformation. This conformation usually will be 
different, even if only a single base is changed. Most conformations have been reported 
to alter the physical configuration or size sufficiently to be detectable by electrophoresis. 
A number of protocols have been described for SSCP including, but not limited to Lee et 
al, Anal Biochem. 205/289-293 (1992); Suzuki et al, Anal Biochem. 7P2/82-84 (1991); 
Lo etal t Nucleic Acids Research 20:1005-1009 (1992); Sarkar et al, Genomics 75:441- 
443 (1992)). It is understood that one or more of the nucleic acids of the present 
invention, may be utilized as markers or probes to detect polymorphisms by SSCP 
analysis. 

Polymorphisms may also be found using a DNA fingerprinting technique called 
amplified fragment length polymorphism (AFLP), which is based on the selective PCR 
amplification of restriction fragments from a total digest of genomic DNA to profile that 
DNA. Vos et al, Nucleic Acids Res. 25:4407-4414(1995). This method allows for the 
specific co-amplification of high numbers of restriction fragments, which can be 
visualized by PCR without knowledge of the nucleic acid sequence. 

AFLP employs basically three steps. Initially, a sample of genomic DNA is cut 
with restriction enzymes and oligonucleotide adapters are ligated to the restriction 
fragments of the DNA. The restriction fragments are then amplified using PCR by using 
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the adapter and restriction sequence as target sites for primer annealing. The selective 
amplification is achieved by the use of primers that extend into the restriction fragments, 
amplifying only those fragments in which the primer extensions match the nucleotide 
flanking the restriction sites. These amplified fragments are then visualized on a 
5 denaturing poly aery lamide gel. 

AFLP analysis has been performed on Salix (Beismann et al, Mol. Ecol 6:989- 
993 (1997)); Acinetobacter (Janssen eta!., Int. J. Syst. Bacteriol 47:1179-1187 (1997)), 
Aeromonas popoffi (Huys et al, Int. J. Syst. Bacteriol 47:1 165-1 171 (1997)), rice 
(McCouch et al , Plant Mol Biol 55:89-99 (1 997)); Nandi et al , Mol Gen. Genet. 
% 10 255:1-8 (1997); Cho et al, Genome 39:373-378 (1996)), barley (Hordeum vulgare) 

Ji: 

hi (Simons et al, Genomics 44:61-70 (1997); Waugh et al, Mol. Gen. Genet. 255:31 1-321 

u i 

W (1997); Qi et al.,Mol. Gen. Genet. 254:330-336 (1997); Becker et al, Mol Gen. Genet. 

y = 

f 249:65-73 (1995)), potato (Van der Voort et al, Mol Gen. Genet. 255:438-447 (1997); 

jTj Meksem et al,Mol Gen. Genet. 249:74-81 (1995)), Phytophthora infestans (Van der Lee 

n i 

y j 1 5 et al , Fungal Genet. Biol 21 :278-29 1 ( 1 997)), Bacillus anthracis (Keim et al , J. 

t M 

f 8 * Bacteriol 1 79:% 1 8-824 (1 997)), Astragalus cremnophylax (Travis et al , Mol. Ecol 

5:735-745 (1996)), Arabidops is (Cnops etal,Mol Gen. Genet. 253:32-41 (1996)), 
Escherichia coli (Lin et al, Nucleic Acids Res. 24:3649-3650 (1996)), Aeromonas (Huys 
et al, Int. J. Syst. Bacteriol 46:572-580 (1996)), nematode (Folkertsma et al,Mol Plant 
20 Microbe Interact. 9:41-54 (1996)), tomato (Thomas et al, Plant J. 8:185-194 (1995)), 
and human (Latorra et al , PCR Methods Appl 3:351-358 (1994)). AFLP analysis has 
also been used for fingerprinting mRNA (Money et al, Nucleic Acids Res. 24:2616-2617 
(1996); Bachem, et al, Plant J. 9:745-753 (1996)). It is understood that one or more of 
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the promoter sequences of the present invention, may be utilized as markers or probes to 
detect polymorphisms by AFLP analysis for fingerprinting mRNA. 

Polymorphisms may also be found using random amplified polymorphic DNA 
(RAPD) (Williams et al, Nucl Acids Res. 75:6531-6535 (1990)) and cleavable 
amplified polymorphic sequences (CAPS) (Lyamichev et al, Science 260:778-783 
(1993)). It is understood that one or more of the promoter sequences of the present 
invention, may be utilized as markers or probes to detect polymorphisms by RAPD or 
CAPS analysis. 

Promoter sequences of the present invention can be used to in a microarray-based 
method for high-throughput screening of plant genomic DNA. This 'chip'-based 
approach involves using microarrays of nucleic acid molecules as gene-specific 
hybridization targets to identify and quantitatively measure the corresponding plant genes 
(Schena et aL, Science 270:467-470 (1995); Shalon, Ph.D. Thesis. Stanford University 
(1996)). Every nucleotide in a large sequence can be queried at the same time. 
Hybridization can be used to efficiently analyze nucleotide sequences. 

Several microarray methods have been described. For example, microarrays of 
BACs may be prepared to sufficiently cover 3X of an entire genome. Such microarrays 
can be used in a variety of genomics experiments including gene mapping, DNA 
fingerprinting and promoter identification. Microarrays of genomic DNA can also be 
used for parallel analysis of genomes at single gene resolution (Lemieux et al, Molecular 
Breeding 277-289 (1988)). It is understood that one or more of the molecules of the 
present invention, preferably one or more of the promoter sequences of the present 
invention may be utilized in a genomic microarray based method. In a preferred 
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embodiment of the present invention, one or more of the rice genomic promoter 
sequences may be utilized in a genomic microarray based method. For example, 
Genomic Mismatch Scanning (GMS), a hybridization-based method of linkage analysis 
that allows rapid identification of regions of identity-by-descent between two related 

5 individuals, can be carried out with microarrays. GMS is reported to have been used to 
identify genetically common chromosomal segments based on the ability of these DNA 
sequences to form extensive regions of mismatch-free heteroduplexes. A series of 
enzymatic steps, coupled with filter binding, is used to selectively remove heteroduplexes 
that contain mismatches (L e. , chromosomal regions that do not share identity-by 

10 descent.). Fragments of chromosomal DNA representing inherited regions are hybridized 
to a microarray of ordered genomic clones and positive hybridization signals pinpoint 
regions of identity-by-descent at high resolution (Lemieux et aL, Molecular Breeding 
277-289(1988)). 

It is understood that one or more of the nucleic acid molecules of the present 
15 invention may be utilized in a GMS microarray based method to locate regions of 
identity-by-descent between related individuals. In a preferred embodiment of the 
present invention, one or more of the nucleic acid molecules of the present invention may 
be utilized in a GMS microarray based method to locate regions of identity-by-descent 
between related individuals. 
20 A particularly preferred microarray embodiment of the present invention is a 

microarray comprising nucleic acid molecules that are homologues of known sequences 
but elicit only limited or no matches to known nucleic acid molecules. A further 
preferred microarray embodiment of the present invention is a microarray comprising 
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genomic nucleic acid molecules of the present invention that elicit only limited or no 
matches to known genes. 

It is understood that one or more of the molecules of the present invention, 
preferably one or more of the promoter sequences of the present invention may be 
5 utilized in a microarray based method. 

In a preferred embodiment of the present invention, one or more of the nucleic 
acid molecules of the present invention may be utilized in a microarray based method. 
Computer related Uses of the Invention 

A nucleic acid molecule comprising SEQ ID NO:l through SEQ ID NO:57 5 467, 
10 complements thereof and fragments of either, or a nucleic acid molecule that hybridizes 
under stringent conditions with SEQ ID NO:l through SEQ ID NO:57,467, or any 
complement thereof; or exhibits 85% or greater identity, and more preferably at least 86 
or greater, 87 or greater, 88 or greater, 89 or greater, 90 or greater, 91 or greater, 92 or 
greater, 93 or greater, 94 or greater, 95 or greater, 96 or greater, 97 or greater, 98 or 
15 greater, or 99% or greater identity to SEQ ID NO:57,467; can be "provided" in a variety 
of mediums to facilitate its. Such a medium can also provide a subset thereof in a form 
that allows a skilled artisan to examine the sequences. 

In a preferred embodiment, at least 20, 50, 100, 500, 1,000, 2,000, 3,000, or 4,000 
of the nucleic acid sequences of the present invention are provided in a variety of 
20 mediums. In one application of this embodiment, a nucleotide sequence of the present 
invention can be recorded on computer readable media. As used herein, "computer 
readable media" refers to any medium that can be read and accessed directly by a 
computer. Such media include, but are not limited to: magnetic storage media, such as 
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floppy discs, hard disc, storage medium, and magnetic tape: optical storage media such 
as CD-ROM; electrical storage media such as RAM and ROM; and hybrids of these 
categories such as magnetic/optical storage media. A skilled artisan can readily 
appreciate how any of the presently known computer readable mediums can be used to 
create a manufacture comprising computer readable medium having recorded thereon a 
nucleotide sequence of the present invention. 

As used herein, "recorded" refers to a process for storing information on computer 
readable medium. A skilled artisan can readily adopt any of the presently known 
methods for recording information on computer readable medium to generate media 
comprising the nucleotide sequence information of the present invention. A variety of 
data storage structures are available to a skilled artisan for creating a computer readable 
medium having recorded thereon a nucleotide sequence of the present invention. The 
choice of the data storage structure will generally be based on the means chosen to access 
the stored information. In addition, a variety of data processor programs and formats can 
be used to store the nucleotide sequence information of the present invention on computer 
readable medium. The sequence information can be represented in a word processing 
text file, formatted in commercially-available software such as WordPerfect and 
Microsoft Word, or represented in the form of an ASCII file, stored in a database 
application, such as DB2, Sybase, Oracle, or the like. A skilled artisan can readily adapt 
any number of data processor structuring formats (e.g., text file or database) in order to 
obtain computer readable medium having recorded thereon the nucleotide sequence 
information of the present invention. 

By providing one or more of nucleotide sequences of the present invention, a 
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skilled artisan can routinely access the sequence information for a variety of purposes. 
Computer software is publicly available which allows a skilled artisan to access sequence 
information provided in a computer readable medium. The examples which follow 
demonstrate how software which implements the BLAST (Altschul et ai, J. Mol Biol 
275:403-410 (1990)) and BLAZE (Brutlag et al, Comp. Chem. 77:203-207 (1993)) 
search algorithms on a Sybase system can be used to identify open reading frames 
(ORFs) within the genome that contain homology to ORFs or proteins from other 
organisms. Such ORFs are protein-encoding fragments within the sequences of the 
present invention and are useful in producing commercially important proteins such as 
enzymes used in amino acid biosynthesis, metabolism, transcription, translation, RNA 
processing, nucleic acid and a protein degradation, protein modification, and DNA 
replication, restriction, modification, recombination, and repair. 

The present invention further provides systems, particularly computer-based 
systems, which contain the sequence information described herein. Such systems are 
designed to identify commercially important fragments of the nucleic acid molecule of 
the present invention. As used herein, "a computer-based system" refers to the hardware 
means, software means, and data storage means used to analyze the nucleotide sequence 
information of the present invention. The minimum hardware means of the computer- 
based systems of the present invention comprises a central processing unit (CPU), input 
means, output means, and data storage means. A skilled artisan can readily appreciate 
that any one of the currently available computer-based system are suitable for use in the 
present invention. 

As indicated above, the computer-based systems of the present invention 
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comprise a data storage means having stored therein a nucleotide sequence of the present 
invention and the necessary hardware means and software means for supporting and 
implementing a search means. As used herein, "data storage means" refers to memory 
that can store nucleotide sequence information of the present invention, or a memory 
access means which can access manufactures having recorded thereon the nucleotide 
sequence information of the present invention. As used herein, "search means" refers to 
one or more programs which are implemented on the computer-based system to compare 
a target sequence or target structural motif with the sequence information stored within 
the data storage means. Search means are used to identify fragments or regions of the 
sequence of the present invention that match a particular target sequence or target motif 
A variety of known algorithms are disclosed publicly and a variety of commercially 
available software for conducting search means are available and can be used in the 
computer-based systems of the present invention. Examples of such software include, but 
are not limited to, MacPattern (EMBL), BLASTIN and BLASTIX (NCBIA). One of the 
available algorithms or implementing software packages for conducting homology 
searches can be adapted for use in the present computer-based systems. 

As used herein, "a target structural motif," or "target motif," refers to any 
rationally selected sequence or combination of sequences in which the sequence(s) are 
chosen based on primary sequence composition or a three dimensional configuration 
which is formed upon folding of the target motif. There are a variety of target motifs 
known in the art. Target motifs include, but are not limited to, transcription factor 
binding sites, repressor binding sites, inducible expression elements, transcriptional 
activation sites, transcription initiation sites, untranslated leaders, intron splicing sites, 
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methylation sites, histone binding sites, RNA processing sites, non-histone structural 
protein binding sites, replication sites, sites which influence the stability of transcribed 
mRNA message and hairpin sites. 

Thus, the present invention further provides an input means for receiving a target 
sequence, a data storage means for storing the target sequences of the present invention 
sequence identified using a search means as described above, and an output means for 
outputting the identified homologous sequences. A variety of structural formats for the 
input and output means can be used to input and output information in the computer- 
based systems of the present invention. A preferred format for an output means ranks 
fragments of the sequence of the present invention by varying degrees of homology to the 
target sequence or target motif. Such presentation provides a skilled artisan with a 
ranking of sequences which contain various amounts of the target sequence or target 
motif and identifies the degree of homology contained in the identified fragment. 

A variety of comparing means can be used to compare a target sequence or target 
motif with the data storage means to identify sequence fragments sequence of the present 
invention. For example, implementing software which implement the BLAST and 
BLAZE algorithms (Altschul et al, 1 Mol Biol 275:403-410 (1990)) can be used to 
identify open frames within the nucleic acid molecules of the present invention. A skilled 
artisan can readily recognize that any one of the publicly available homology search 
programs can be used as the search means for the computer-based systems of the present 
invention. 

Having now generally described the invention, the same will be more readily 
understood through reference to the following examples which are provided by way of 
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illustration, and are not intended to be limiting of the present invention, unless specified. 

Each periodical, patent, and other document or reference cited herein is herein 
incorporated by reference in its entirety. 

EXAMPLES 

The following examples are provided to better illustrate the practice of the present 
invention and should not be interpreted in any way to limit the scope of the present 
invention. Those skilled in the art will recognize that various modifications, truncations, 
etcetera can be made to the methods and genes described herein while not departing from 
the spirit and scope of the present invention. 

Example 1 : Generating a genomic bacterial artificial chromosome (B AC) library 

BACs are stable, non-chimeric cloning systems having genomic fragment inserts 
(100-300 kb) and their DNA can be prepared for most types of experiments including 
DNA sequencing. BAC vector, pBeloBACl 1, is derived from the endogenous E. coli F- 
factor plasmid, which contains genes for strict copy number control and unidirectional 
origin of DNA replication. Additionally, pBeloBACl 1 has three unique restriction 
enzyme sites {Hind III, Bam HI and Sph I) located within the LacZ gene which can be 
used as cloning sites for megabase-size plant DNA. Indigo, another BAC vector contains 
Hind III and Eco RI cloning sites. This vector also contains a random mutation in the 
LacZ gene that allows for darker blue colonies. 

As an alternative, the PI -derived artificial chromosome (PAC) can be used as a 
large DNA fragment cloning vector (Ioannou, et al, Nature Genet. 5:84-89 (1994); 
Suzuki, et al 9 Gene 799:133-137 (1997)). The PAC vector has most of the features of the 
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BAC system, but also contains some of the elements of the bacteriophage PI cloning 
system. 

BAC libraries are generated by ligating size-selected restriction digested DNA 
with pBeloBACl 1 followed by electroporation into E. coli. BAC library construction 
and characterization is extremely efficient when compared to YAC (yeast artificial 
chromosome) library construction and analysis, particularly because of the chimerism 
associated with YACs and difficulties associated with extracting YAC DNA. 

There are general methods for preparing megabase-size DNA from plants. For 
example, the protoplast method yields megabase-size DNA of high quality with minimal 
breakage. A process involves preparing young leaves which are manually feathered with 
a razor-blade before being incubated for four to five hours with cell-wall-degrading 
enzymes. A second method developed by Zhange et a/., Plant J. 7:175-184 (1995), is a 
universal nuclei method that works well for several divergent plant taxa. Fresh or frozen 
tissue is homogenized with a blender or mortar and pestle. Nuclei are then isolated and 
embedded. DNA prepared by the nucleic method is often more concentrated and is 
reported to contain lower amounts of chloroplast DNA than the protoplast method. 

Once protoplasts or nuclei are produced, they are embedded in an agarose matrix 
as plugs or microbeads. The agarose provides a support matrix to prevent shearing of the 
DNA while allowing enzymes and buffers to diffuse into the DNA. The DNA is purified 
and manipulated in the agarose and is stable for more than one year at 4°C. 

Once high molecular weight DNA is prepared, it is fragmented to the desired size 
range. In general, DNA fragmentation utilizes two general approaches, 1) physical 
shearing and 2) partial digestion with a restriction enzyme that cuts relatively frequently 
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within the genome. Since physical shearing is not dependent upon the frequency and 
distribution of particular restriction enzymes sites, this method should yield the most 
random distribution of DNA fragments. However, the ends of the sheared DNA 
fragments must be repaired and cloned directly or restriction enzyme sites added by the 
5 addition of synthetic linkers. Because of the subsequent steps required to clone DNA 
fragmented by shearing, most protocols fragment DNA by partial restriction enzyme 
digestion. The advantage of partial restriction enzyme digestion is that no further 
enzymatic modification of the ends of the restriction fragments are necessary. Four 
common techniques that can be used to achieve reproducible partial digestion of 

10 megabase-size DNA are 1) varying the concentration of the restriction enzyme, 2) 

varying the time of incubation with the restriction enzyme 3) varying the concentration of 
an enzyme cofactor (e.g., Mg2+) and 4) varying the ratio of endonuclease to methylase. 

There are three cloning sites in pBeloBACl 1, but only Hind III and Bam HI 
produce 5' overhangs for easy vector dephosphorylation. These two restriction enzymes 

15 are primarily used to construct BAC libraries. The optimal partial digestion conditions 
for megabase-size DNA are determined by wide and narrow window digestions. To 
optimize the optimum amount of Hind III, 1, 2, 3, 10, and 5- units of enzyme are each 
added to 50 ml aliquots of microbeads and incubated at 37 °C for 20 minutes. 

After partial digestion of megabase-size DNA, the DNA is run on a pulsed-field 

20 gel, and DNA in a size range of 100-500 kb is excised from the gel. This DNA is ligated 
to the BAC vector or subjected to a second size selection on a pulsed field gel under 
different running conditions. Studies have previously reported that two rounds of size 
selection can eliminate small DNA fragments co-migrating with the selected range in the 
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first pulse-field fractionation. Such a strategy results in an increase in insert sizes and a 
more uniform insert size distribution. A practical approach to performing size selections 
is to first test for the number of clones/microliter of ligation and insert size from the first 
size selected material. If the numbers are good (500 to 2000 white colony/microliter of 
ligation) and the size range is also good (50 to 300 kb) then a second size selection is 
practical. When performing a second size selection one expects a 80 to 95% decrease in 
the number of recombinant clones per transformation. 

Twenty to two hundred nanograms of the size-selected DNA is ligated to 
dephosphorylated BAC vector (molar ratio of 10 to 1 in BAC vector excess). Most BAC 
libraries use a molar ratio of 5 to 15 : 1 (size selected DNA:BAC vector). 

Transformation is carried out by electroporation and the transformation efficiency 
for B ACs is about 40 to 1 ,500 transformants from one microliter of ligation product or 20 
to 1000 transformants/ng DNA. 

Several tests can be carried out to determine the quality of a BAC library. Three 
basic tests to evaluate the quality include: the genome coverage of a BAC library- 
average insert size, average number of clones hybridizing with single copy probes and 
chloroplast DNA content. 

The determination of the average insert size of the library is assessed in two ways. 
First, during library construction every ligation is tested to determine the average insert 
size by assaying 20-50 BAC clones per ligation. DNA is isolated from recombinant 
clones using a standard mini preparation protocol, digested with Not I to free the insert 
from the BAC vector and then sized using pulsed field gel electrophoresis (Maule, 
Molecular Biotechnology 9:107-126 (1998)). 
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To determine the genome coverage of the library, it is screened with single copy 
RFLP markers distributed randomly across the genome by hybridization. Microtiter 
plates containing BAC clones are spotted onto Hybond membranes. Bacteria from 48 or 
72 plates are spotted twice onto one membrane resulting in 18,000 to 27,648 unique 
clones on each membrane in either a 4X4 or 5X5 orientation. Since each clone is present 
twice, false positives are easily eliminated and true positives are easily recognized and 
identified. 

Finally, the chloroplast DNA content in the BAC library is estimated by 
hybridizing three chloroplast genes spaced evenly across the chloroplast genome to the 
library on high density hybridization filters. 

There are strategies for isolating rare sequences within the genome. For example, 
higher plant genomes can range in size from 100 Mb/lC (Arabidopsis) to 15,966 Mb/C 
{Triticum aestivum), (Arumuganathan and Earle, Plant Mol Bio Rep.9:20%-2\9 (1991)). 
The number of clones required to achieve a given probability that any DNA sequence will 
be represented in a genomic library is N = (ln(l-P))/(ln(l-L/G)) where N is the number of 
clones required, P is the probability desired to get the target sequence, L is the length of 
the average clone insert in base pairs and G is the haploid genome length in base pairs 
(Clarke etal 9 Cell 9:91-100 (1976)). 

The rice BAC library of the present invention is constructed in the pBeloBACl 1 
or similar vector. Inserts are generated by partial Eco RI or other enzymatic digestion of 
DNA. The 25X library provides 4-5X coverage sequence from BAC clones across 
genome. 

Example 2: Sequencing genomic DNA inserts from a genomic BAC Library 
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Two basic methods can be used for DNA sequencing, the chain termination 
method of Sanger et al, Proc. Natl Acad. Sci. USA 74/5463-5467 (1977), and the 
chemical degradation method of Maxam and Gilbert, Proc. Natl. Acad. Sci. USA 74:560- 
564 (1977),. Automation and advances in technology such as the replacement of 
radioisotopes with fluorescence-based sequencing have reduced the effort required to 
sequence DNA (Craxton, Methods, 2/20-26 (1991); Ju et al, Proc. Natl. Acad. Set USA 
P2/4347-4351 (1995); Tabor and Richardson, Proc. Natl Acad Sci. USA 92/6339-6343 
(1995)). Automated sequencers are available from, for example, Pharmacia Biotech, Inc., 
Piscataway, New Jersey (Pharmacia ALF), LI-COR, Inc., Lincoln, Nebraska (LI-COR 
4,000) and Millipore, Bedford, Massachusetts (Millipore BaseStation). 

In addition, advances in capillary gel electrophoresis have also reduced the effort 
required to sequence DNA and such advances provide a rapid high resolution approach 
for sequencing DNA samples (Swerdlow and Gesteland, Nucleic Acids Res. 75:1415- 
1419 (1990); Smith, Nature 549:812-813 (1991); Luckey et al, Methods Enzymol 
275:154-172 (1993); Lu et al.,J. Chromatog. A. 680:497-501 (1994); Carson et al , Anal 
Chem. 55:3219-3226 (1993); Huang etal,Anal Chem. 64:2149-2154 (1992); Kheterpal 
et al , Electrophoresis 77:1852-1859 (1996); Quesada and Zhang, Electrophoresis 
77:1841-1851 (1996); Baba, Yakugaku Zasshi 777:265-281 (1997)). 

A number of sequencing techniques are known in the art, including fluorescence- 
based sequencing methodologies. These methods have the detection, automation and 
instrumentation capability necessary for the analysis of large volumes of sequence data. 
Currently, the 377 DNA Sequencer (Perkin-Elmer Corp., Applied Biosystems Div., 
Foster City, CA) allows the most rapid electrophoresis and data collection. With these 
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types of automated systems, fluorescent dye-labeled sequence reaction products are 
detected and data entered directly into the computer, producing a chromatogram that is 
subsequently viewed, stored, and analyzed using the corresponding software programs. 
These methods are known to those of skill in the art and have been described and 
reviewed (Birren et aL, Genome Analysis: Analyzing DNA,l, Cold Spring Harbor, New 
York (1999)). 

PHRED is used to call the bases from the sequence trace files 
f http ://www. mbt . Washington . edu ) . Phred uses Fourier methods to examine the four base 
traces in the region surrounding each point in the data set in order to predict a series of 
evenly spaced predicted locations. That is, it determines where the peaks would be 
centered if there were no compressions, dropouts, or other factors shifting the peaks from 
their "true" locations. Next, PHRED examines each trace to find the centers of the actual, 
or observed peaks and the areas of these peaks relative to their neighbors. The peaks are 
detected independently along each of the four traces so many peaks overlap. A dynamic 
programming algorithm is used to match the observed peaks detected in the second step 
with the predicted peak locations found in the first step. 

After the base calling is completed, contaminating sequences (E. coli, BAC vector 
sequences > 50 bases and sub-cloning vector are removed and constraints are made for 
the assembler. Contigs are assembled using CAP3 (Huang, et aL, Genomics 46: 37-45 
(1997)). 

A two-step re-assembly process is employed to reduce sequence redundancies 
caused by overlaps between BAC clones. In the first step, BAC clones are grouped into 
clusters based on overlaps between contig sequences from different BACs. These 
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overlaps are identified by comparing each sequence in the dataset against every other 
sequences, by BLAS.TN. BACs containing overlaps greater than 5,000 base pairs in 
length and greater than 94% in sequence identity are put into the same cluster. Repetitive 
sequences are masked prior to this procedure to avoid false joining by repetitive elements 
present in the genome. In the second step, sequences from each BAG cluster are 
assembled by PHRAP.longread, which is able to handle very long sequences. A 
minimum match is set at 100 bp and a mininum score is set at 600 as a threshold to join 
input contigs into longer contigs. 

Example 3 : Identifying genes within a genomic B AC library 

This example illustrates the identification of combigenes within the rice genomic 
contig library as assembled in Example 2. The genes and partial genes that are 
embedded in such contigs are identified through a series of informatic analyses. The 
tools to define genes fall into two categories: homology-based and predictive-based 
methods. Homology-based searches (e.g., GAP2, BLASTX supplemented by NAP and 
TBL ASTX) detect conserved sequences during comparisons of DNA sequences or 
hypothetically translated protein sequences to public and/or proprietary DNA and protein 
databases. Existence of an Oryza sativa gene is inferred if significant sequence similarity 
extends over the majority of the target gene. Since homology-based methods may 
overlook genes unique to Oryza sativa, for which homologous nucleic acid molecules 
have not yet been identified in databases, gene prediction programs are also used. 
Predictive methods employed in the definition of the Oryza sativa genes included the use 
of the GenScan gene predictive software program which is available from Stanford 
University (e.g., at the website: gnomic/stanford.edu/GENSCANW.htmh and the 
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Genemark.hmm for Eukaryotes program from Gene Probe, Inc (Atlanta, GA) 
www.geneprobe.net/index.htm X GenScan, in general terms, infers the presence and 
extent of a gene through a search for "gene-like" grammar. GeneMark.hmm searches a 
file containing DNA sequence data for genes. It employs a Hidden Markov Model 
algorithm with a species-specific inhomogeneous Markov model of gene-encoding 
regions of DNA. 

The homology-based methods that are used to define the Oryza saliva gene set 
included GAP2, BLASTX supplemented by NAP and TBLASTX. For a description of 
BLASTX and TBLASTX see Coulson, Trends in Biotechnology 12:76-80 (1994) and 
Birren et al, Genome Analysis, 1 :543-559 (1997). GAP2 and NAP are part of the 
Analysis and Annotation Tool (AAT) for Finding Genes in Genomic Sequences which 
was developed by Xiaoqiu Huang at Michigan Tech University and is available at the 
web site genome.cs.mtu.edu/ . The AAT package includes two sets of programs, one set 
DPS/NAP (referred to as "NAP") for comparing the query sequence with a protein 
database, and the other set DDS/GAP2 (referred to as "GAP2") for comparing the query 
sequence with a cDNA database. Each set contains a fast database search program and a 
rigorous alignment program. The database search program identifies regions of the query 
sequence that are similar to a database sequence. Then the alignment program constructs 
an optimal alignment for each region and the database sequence. The alignment program 
also reports the coordinates of exons in the query sequence. See Huang, et al, Genomics 
46: 37-45 (1997). The GAP2 program computes an optimal global alignment of a 
genomic sequence and a cDNA sequence without penalizing terminal gaps. A long gap 
in the cDNA sequence is given a constant penalty. The DNA-DNA alignment by GAP2 
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adjusts penalties to accommodate introns. The GAP2 program makes use of splice site 
consensuses in alignment computation. GAP2 delivers the alignment in linear space, so 
long sequences can be aligned. See Huang, Computer Applications in the Biosciences 10 
227-235 (1994). The GAP2 program aligns the Oryza sativa contigs with a library of 
42,260 Oryza sativa cDNAs. 

The NAP program computes a global alignment of a DNA sequence and a protein 
sequence without penalizing terminal gaps. NAP handles frameshifts and long introns in 
the DNA sequence. The program delivers the alignment in linear space, so long 
sequences can be aligned. It makes use of splice site consensuses in alignment 
computation. Both strands of the DNA sequence are compared with the protein sequence 
and one of the two alignments with the larger score is reported. See Huang, and Zhang, 
Computer Applications in the Biosciences 12(6), 497-506 (1996). 

NAP takes a nucleotide sequence, translates it in three forward reading frames and 
three reverse complement reading frames, and then compares the six translations against a 
protein sequence database (e.g. the non-redundant protein (i.e., nr-aa database maintained 
by the National Center for Biotechnology Information as part of GenBank and available 
at the web site: www.ncbi.nlm.nih.gov ). 

The first homology-based search for genes in the Oryza sativa contigs is effected 
using the GAP2 program and the Oryza sativa library of clustered Oryza sativa cDNA. 
The Oryza sativa clusters are mapped onto an assembly of Oryza sativa contigs using the 
GAP2 program. GAP2 standards for selecting a DNA-DNA match are > 92% sequence 
identity with the following parameters: 

gap extension penalty = 1 
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match score = 2 
gap open penalty = 6 
gap length for constant penalty - 20 
mismatch penalty = -2 
5 minimum exon length = 21 

minimum total length of all exons in a gene (in nucleotide) = 200 
When a particular Oryza sativa cDNA aligns to more than one Oryza sativa 
contig, the alignment with the highest identity is selected and alignments with lower 
levels of identity are filtered out as surreptitious alignments. Oryza sativa cDNA 
sequences aligning to Oryza sativa contigs with exceptionally low complexity are filtered 
out when the basis for alignment included a high number of cDNAs with poly A tails 
aligning to genomic regions with extended repeats of A or T. 

The second homology-based method used for gene discovery is BLASTX hits 
extended with the NAP software package. BLASTX is run with the Oryza sativa 
genomic contigs as queries against the GenBank non-redundant protein data library 
identified as "nr-aa". NAP is used to better align the amino acid sequences as compared 
to the genomic sequence. NAP extends the match in regions where BLASTX has 
identified high-scoring-pairs (HSPs), predicts introns, and then links the exons into a 
single ORF prediction. Experience suggests that NAP tends to mis-predict the first exon. 
20 The NAP parameters are: 

gap extension penalty = 1 

gap open penalty = 1 5 

gap length for constant penalty = 25 
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min exon length (in aa) = 7 

minimum total length of all exons in a gene (in nucleotide) = 200 
homology > 40% 

The NAP alignment score and GenBank reference number for best match are 
reported for each contig for which there is a NAP hit. 

In the final homology-based method, TBLASTX, is used with cDNA information 
from four plant sequencing projects: 27,037 sequences from Triticum aestivum, 136,074 
sequences from Glycine max, 71,822 sequences from Zea mays and 68,517 sequences 
from Arabidopsis thaliana. Conservative standards for inclusion of TBLASTX hits into 
the gene set are utilized. These standards are a minimal E value of IE- 16, and a minimal 
match of 150 bp in Oryza sativa contig. 

The GenScan program is "trained" with Arabidopsis thaliana characteristics. 
Though better than the "off-the-shelf version, the GenScan trained to identify Oryza 
sativa genes proved more proficient at predicting exons than predicting full-length genes. 
Predicting full-length genes is compromised by point mutations in the unfinished contigs, 
as well as by the short length of the contigs relative to the typical length of a gene. Due 
to the errors found in the full-length gene predictions by GenScan, inclusion of GenScan- 
predicted genes is limited to those genes and exons whose probabilities are above a 
conservative probability threshold. The GenScan parameters are: 

weighted mean GenScan P value > 0.4 

mean GenScan T value > 0 

mean GenScan Coding score > 50 

length > 200 bp 

83 



minimum total length of all exons in a gene = 500 

The weighted mean GenScan P value is a probability for correctly predicting 
ORFs or partial ORFs and is defined as the (1/SS li )(SS li Pi), where "1" is the length of 
a exon and "P" is the probability or correctness for the exon. 

The GeneMark.hmm for Eukaryotes program uses the Hidden Markov model for 
species Oryza Sativa. Minimum total length of all exons in a gene is 500bp. Except for 
the model selection, there is no specific run-time parameter for GeneMark.hmm. 

The gene predictions from these programs are stored in a database and then 
combigenes are derived from these predictions. A combigene is a cluster of putative 
genes which satisfy the following criteria: 

All genes making up a single combigene are located on the same strand of a 

contig; 

Maximum intron size of a valid gene is 4000bp; 

Maximum distance between any two genes in the same combigene is 200bp, as 
measured by the bases between adjacent ending exons; 

If an individual gene is predicted by NAP it has at least 40% sequence identity to 

its hit; 

If an individual gene is predicted by GAP2 it has at least 92% sequence identity to 

its hit; 

If an individual gene is predicted by Genscan the weighted average of the 
probabilities calculated for all of its exons is not less than 0.4. The gene boundaries of a 
Genscan-predicted gene are determined while taking into account only exons. 

Since TBLASTX-predicted genes are standless the combigene which is made up 
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of such genes can be assigned a strand only if there is a gene in the cluster that was 
predicted by a strand-defining gene-predicting program. 

Exam ple 4: Identifying promoters in the genomic B AC library using bioinformatic 
techniques 

Candidate promoter sequences are selected by identifying the regions of DNA 
located immediately upstream of "combigenes" as described and defined in Example 3. 
The length of the region to be extracted from the corresponding contig's sequence is set 
to be 1 500 nucleotides plus the very first nucleotide of a combigene. Thus, if a 
combigene is sufficiently far from the edge of a contig a 1501 nucleotide sequence is 
obtained, otherwise the sequence will be shorter. Only coding region predictions are 
considered when building combigenes. Therefore, the 5' UTR of the putative cDNA is 
included as part of the combigene upstream region. 

If there is an AAT/NAP-predicted component in a combigene, then the putative 
promoter sequence is extracted upstream of the beginning of that component, otherwise - 
the sequence is extracted upstream of the beginning of the combigene (which may 
correspond to Genscan, AAT/GAP or a TBLASTX prediction). 

Promoter candidates are further selected using bioinformatic analysis of the 

candidate promoter sequence. 

The candidate promoter regions listed in SEQ ID NO:l through SEQ ID 
NO:57467 are analyzed for known promoter motifs listed in Table 2. 

The identification of such motifs provides important information about the 
candidate promoter. For example, some motifs are associated with informative 
annotations such as "light inducible binding site" or "stress inducible binding motif an< 
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can 



be used to select with confidence a promoter that is able to confer light inducibility or 



stress inducibility to an operably-linked transgene, respectively. 

Putative promoter sequences are also searched with matrices for the TATA box, 

GC box (factor name: V_GC_01) and CCAAT box (factor name: F_HAP234_01). The 

matrix for the TATA box is from the Eukaryotic Promoter Database 

(http://www.epd.isb-sib.ch/) and the matrices for the GC box and the CCAAT box are 

from Transfac (http://transfac.gbf.de/TRANSFAC/). 

The algorithm that is used to annotate promoters searches for matches to both 

sequence motifs and matrix motifs. First, individual matches are found. For sequence 

motifs, a maximum number of mismatches is allowed (see Table 2). If the code 
M,R,W,S,Y, or K are listed in the sequence motif (each of which is a degenerate code for 
2 nucleotides) 1/2 mismatch is allowed. If the code B, D, H, or V are listed in the 
sequence motif (each of which is a degenerate code for 3 nucleotides) 1/3 mismatch is 
allowed, p values are determined by simulation with a 5 Mb of random DNA with the 
same dinucleotide frequency as the test set is generated and the probability of a given 
matrix score is determined (number of hits/5e7). Once the individual hits have been 
found, the putative promoter sequence is searched for clusters of hits in a 250 bp window. 
The score for a cluster is found by summing the negative natural log of the p value for 
each individual hit. Using 100 Mb simulations as described above, the probability of a 
window having a cluster score greater than or equal to the given value is determined. 
Clusters with a p value more significant than p < le-6 are reported. Only the top 287 hits 
are taken and are ranked by p value. Effects of repetitive elements are screened. If the 
287th ranked hit has the same p value as the first ranked hit, no results are reported for 
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that factor. 

For matrix motifs, a p value cutoff is used on a matrix score. The matrix score is 
determined by adding the path of a given DNA sequence through a matrix. P values are 
determined by simulation; 5 Mb of random DNA with the same dinucleotide frequency as 
a test set is generated to test individual matrix hits and 100 Mb is used to test clusters; the 
probability of a given matrix score and the probability scores for clusters are determined 
as are the sequence motifs. The usual cutoff for matrices is 2.5e-4. No clustering is done 
for the TATA box, GC box or CCAAT box. 

Candidate promoters are also selected based on the expression characteristics of 
the gene that is cis-associated with the candidate promoter, (i.e. the native gene). For 
example, a promoter region located 5' to a gene, which is expressed during a specific 
stage of development, likely plays a key role in the temporal regulation of that gene. 
Thus the promoter, when operably linked to a heterologous coding sequence, may 
similarly regulate the heterologous coding sequence. 

Combining the motif analysis with the expression analysis, the list of candidate 
promoters having desired properties can be narrowed. This decreases the overall number 
of candidate promoters that must be screened to confirm the promoter's function. For 
example, one can start with seed-expressed transcription factors, identify candidate 
promoters that match the consensus regulation sites for seed-expressed transcription 
factors, and then test the identified candidate promoters to confirm the promoter sub-set 
which are capable of conferring seed-specific expression to a gene. 
Example 5: Identifying promoters in the genomic BAC library using an expression assay 
Promoters may also be identified based on quantitative analysis of genes that are 
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cis-associated with candidate promoters, (i.e. the native genes). In this method, the native 
genes associated with SEQ ID NO:l through SEQ ID NO:57,467 are analyzed on a 
digital northern blot. Digital northern data can be generated from EST sequencing, 
SAGE and other methods, which in effect count RNA molecules expressed in cell. This 
data can be generated as needfed, or is generally available to the public on a number of 
web sites (e.g., www.tigr.org). Data can be obtained from any plant species, although 
data on rice gene expression is particularly preferred. Promoters are selected based on the 
expression information of the digital northern. For example, identifying genes 
expressing genes under stress-related conditions would provide a group of promoters able 
to confer such stress-inducible expression to other genes. 

Example 6: Identifying promoters in the genomic BAC library using microarrav analysis 

Promoters may also be selected based by transcriptional profiling or microarray 
analysis. Transcriptional profiling can be completed on large scale for each cis-linked 
gene associated with SEQ ID N0:1 through SEQ ID NO:57,467. Transcription profiling 
data can be obtained on RNA prepared from any plant species using a chip comprised of 
sequences from any plant species, although data generated from rice using a rice chip is 
preferred. 

A comprehensive database of transcription profiling data narrows down the list of 
promoter candidates that confer a desired expression pattern. For example, a promoter 
that confers drought-specific expression can be selected by identifying a cis-linked gene 
that is induced under drought conditions (on the microarray), but is not expressed at other 
stages of plant growth and development. Such a promoter is likely to confer drought 
inducibility to an operably linked transgene. Public databases of transcript profiling data 
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are becoming more comprehensive and thereby enabling this type of analysis. 

Example 7: Functional screening of promoters in an expr ession assay 

Promoters are screened in an expression assay. The promoters in SEQ ID NO:l 
through SEQ ID NO:57,467 are amplified by PCR from rice genomic DNA and cloned 
into an expression vector containing a reporter transgene (e.g., GUS or GFP). The 
individual promoter or a collection of promoters ("promoter library") are then screened in 
an expression assay for the ability to express the reporter transgene. In a common 
expression assay for leaf promoters, the promoters are transfected into rice or maize leaf 
protoplasts. Reporter gene expression in the protoplasts indicates a promoter capable of 
conferring gene-expression in the leaf. The promoters are also transfected into 
protoplasts from other tissues or plant species to identify other regulatory features of the 
promoter. 

Alternatively, promoters may be screened using a particle gun technique to 
bombard the cells, tissues or plants. The bombarded samples are visually inspected for 
reporter gene expression. Reporter gene expression observed in any bombarded samples 
indicates the presence of a promoter able to confer expression of a transgene in that cell, 
tissue or plant. 

The promoters may also be screened in plants where transformation protocols 
have been greatly enhanced to facilitate the screening of large numbers of promoters. In 
this approach, the individual rice promoters or "promoter library" is transformed into 
Arabidopsis plants. The resulting transformed tissues or progeny are scored for reporter 
expression. Again, reporter gene expression in a given tissue indicates that a promoter is 
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able to confer transgene expression in that tissue. 

For some promoters, such as those providing constitutive expression, a reporter 
transgene can be replaced with a selectable marker transgene, such as a gene conferring 
glyphosate tolerance. Transformed cells, tissues or plants expressing the selectable 
5 marker are selected, rather than visually scored. For example, the promoter is linked to a 
selectable marker, such as glyphosate resistance, and then screening for male sterile 
plants. The selected plants, in this case male sterile plants, may contain a promoter for 

male reproductive tissues. 

The promoters described herein can also be used to ablate or kill cells expressing 
10 a gene from the promoter. In such cases, the promoter is operably linked to a negative 
selectable marker gene, including but not limited to the diptheria toxin gene, or to a 
conditional lethal gene, including but not limited to the phosphonate ester hydrolase gene 
(pehA). The negative selectable marker gene is transformed into cells, tissues or plants. 
The cells, tissues or plants which express the negative selectable gene from the promoter 
15 are selectively killed. In the case of the conditional lethal gene, the transformed cells, 

plants which express the conditional lethal gene are only killed in the presence 



tissues or 



of the negative selective agent or negative selective condition. In the example of the 
phosphonate ester hydrolase gene, the transformed cells, tissues or plants which express 
the conditional lethal gene are only killed in the presence of glyceryl glyphoste. 
20 Table 1 

The data in Table 1 provides features relating to the putative promoter sequences. 
* column headings 

Sea Num : Provides the SEQ ID NO. for the rice contigs on which the putative 
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promoter sequences are found. 

Contig ID : unique identifier of the rice contig 

CmbGID : name of the putative promoter sequence. Putative promoters are 
named as cg_"no M . The "no" refers to the combigene from which the putative promoter is 



selected. 



CNTG LEN : The length of the contig 

BEGN POS : starting position of the putative promoter sequence; 
STRND: DNA strand on which the combigene is located 



LEN: length of the putative promoter sequence 



10 Table 2 



Table 2 lists the sequence motifs that are searched in the putative promoter 
sequences 

Table 2 

Transcription Factor NameSequence Motif nameSequence MotifSequence Motif 
15 LengthMaximum mismatches allowedReference for transcription factors and sequence 

motifsFac006 AB ADESI 1 RTACGTGGCR 1 0 1 PL ACEFac007 AB ADESI2GG ACGCGTGGC 1 1 2PLACEFa 
cO 1 0 ABFOSGC ATCTTTACTTTAGCATC 1 96PLACEFacO 1 6 ABRE3 OSRAB 1 6GTACGTGGCGC 1 1 2PL 
ACEFacO 1 6ABREATRD22RYACGTGG YR1 00PLACEFac020ABREOSRAB2 1 ACGTSSSC80PLACEFa 
c02 1 ABREOSRG A 1 CC ACGTGG80PL ACEFac02 1 ABRETAEMGG AC ACGTGGC 1 1 2PL ACEFac022 AC 
20 GTABOXTACGTA60PL ACEFac03 1 AM YBOX 1 TA AC ARA70PL ACEFac032 AMYBOX2TATCC AT70 
PLACEFac060DREDRlATRD29ABTACCGACAT91PLACEFac064EREGCCNTCHNTAAGAGCCGC 

C112PLACEFac066GARE2RTAACARANTCYGG142PLACEFac068GBOXRELOSAMY3CTACGTGG 
CCA 1 1 2PL ACEFac070GLUTA AC AOS A AC A A ACTCTAT 1 22PLACEFac07 1 1 OSGT2GLUTEBOXAT 
ATCATG AGTC ACTTC A 1 84PLACEFac07 1 1 OSGT2GLUTEBOX ATATCATG AGTCACTTC A 1 84PLA 
25 CEFac07 1 1 OSGT3GLUTEBOXTATCTAGTG AGTC ACTTCA 1 95PLACEFac07 1 1 OSGT3GLUTEBOXT 
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ATCTAGTGAGTCACTTCA195PLACEFac0722OSGT2GLUTEBOXTCCGTGTACCAl 12PLACEFac07 
220SGT2GLUTEBOXTCCGTGTACCA1 12PLACEFac0722OSGT3GLUTEBOXCTTTTGTGTACCTT 
A153PLACEFac0722OSGT3GLUTEBOXCTTTTGTGTACCTTA153PLACEFac073GLUTEBPlOSAAG 
CAACACACAAC 1 43PLACEFac074GLUTEBP2OSATGCTCAATAGATATAAGT 1 95PLACEFac075G 
5 LUTECOREOSCTTTCGTGTAC 1 1 2PLACEFac079GT2OSPHYAGCGGTAATT9 1 PLACEFac 105M YB 
GAHVTAACAAA70PLACEFacl29PROLAMINBOXCACATGTGTAAAGGT154PLACEFacl35RGAT 

AOSC AG AAGATA9 1 PLACEFac 1 3 6RNFG 1 OSGATC ATCG ATC 1 1 2PLACEFac 1 3 7RNFG20SCCAGT 
GTGCCCCTGG 1 54PLACEFac 1 39RYREPEAT4TCCATGCATGCAC 1 33PLACEFac 1 3 9RYREPE AT4T 
CC ATGC ATGC AC 1 3 3PL ACEFac 1 39RYREPEATGMG Y2C ATGC AT70PLACEFac 1 3 9RYREPE ATGM 
1 0 G Y2C ATGCAT70PLACEFac 1 3 9RYREPEATLEGUMrNBOXCATGC A Y70PLACEFac 1 3 9RYREPEAT 
LEGUMINBOXCATGCAY70PLACEFac 1 39RYREPEATVFLEB4CATGCATG80PLACEFac 1 39RYREP 
a EATVFLEB4CATGCATG80PLACEFac 1 49SITEIIAOSPCNATGGGCCCGT9 1 PLACEFac 1 50SITEIIBO 

ii J 

pi SPCNATGGTCCC AC9 1 PLACEFac 1 5 1 SITEIOSPCNACC AGGTGG8 1 PLACEFac 1 63 A AC AOSGLUB 1 

m CAACAAACTATATC 143 .5PLACEFac 1 65ACGTOSGLUB 1 GTACGTG70PL ACEFac 1 80GT1CONSEN 

"3? 3 

3=5 5 

m 1 5 SUSGRWA A W60PL ACEFac20 1 P YRIMIDINEBOXOSRAM Y 1 ACCTTTT60PL ACEFac2 1 8 ABREMOTI 
SfB FAOSOSEMTACGTGTC80.5PLACEFac2 1 9ABREMOTIFIIIOSRAB 1 6BGCCGCGTGGC 1 0 1 .5PLACEF 

H ac220 ABREMOTIFIOSRAB 1 6B AGTACGTGGC 101 .5PLACFac223CE30SOSEMAACGCGTGTCl 0 1 .5 

; ~ PLACEFac267POLASIG2AATTAAA70PLACEOS A-boxOS A- 

P boxTATCCATCCATCC 1 33PlantCAREOS_A-box20S_A- 

20 box2 AATAAC AAACTCC 1 3 3PlantCAREOS AACAOS AACATAACAAACTCCA 1 22 .5PlantCARE 

OS ABREOS ABREGACACGTACGT 1 1 2PlantCAREOS ABRE20S ABRE2ACGTACGTGTCG 

CGC154PlantCAREOS_AP-2-likeOS_AP-2-likeCGCGCCGG80.5PlantCAREOS_AP-2- 

like20S_AP-2- 

like2CGACCAGG80.5PlantCAREOS ATGCAAATOS ATGCAAATATACAAAT80.5PlantCAREOS 

25 _CE3OS_CE3GACGCGTGTC101.5PlantCAREOS_GATTOS_GATTCTCCTGATTGGA122.5Plant 
CAREOS_GCN4OS_GCN4TGWGTCA70PlantCAREOS_GCN4_2OS_GCN4_2CAAGCCA70Plant 
CAREOS P-boxOS P-boxGCCTTTTGAGTl 12PlantCAREOS_P-box20S_P- 
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box2CCTTTTG70PlantCAREOS_Prolamin_boxOS_Prolamin_boxTGCAAAGT80.5PlantCAREOS_S 
kn- 1 OS_Skn- 1 GTCAT50PlantC AREOS_TATC-boxOS_TATC- 

boxTATCCCA70PlantCAREOS_TGGCAOS_TGGCAGACACCAAGTGGCA143.5PlantCAREOS_li 
ghtOS lightAACCAATCTCATCCATCC 1 85 .5PlantCAREAS_RF2 A_0 1 AS_RF2 A_0 1 CCAGTGTGGC 

5 GCTGG 1 54TRANSATRS 1 A_0 1 ATRS 1 A_0 1 CTTCCACGTGGCA 1 3 3TRANSPV_GRP 1 8_0 1 PV_GR 
P 1 8_0 1 TGG ATGTGG AAGACAGCA1 85 .5TRANSRICE_ACT_0 1 RICE_ACT_0 1GCCCA ACCCAACC 
C A AC 1 7 5TRANSRICEAGB0 1 RICE_AGB_0 1 GCCACGTAAG 1 0 1 .5TRANSRICEAGB03 RICE_A 
GB03GCCACGTCAG 1 0 1 .5TRANSRICEEM0 1 RICE_EM_0 1 TACGTGT70TRANSRICEEM02RI 
CE_EM_02GACGTGT70TRANSRICE_GL5 1_0 1 RICE_GL5 1_0 1 AAGTCATAACTG 1 22.5TRANSRICE 

10 _GL5 1_02RICE_GL5 1_02CCATGTCATATT122.5TRANSRICE_GL5 1_03RICE_GL5 103AATGATGT 
GTCAAT143.5TRANSRICE_GL51_04PJCE_GL51_04TTCCGTGTACCAC133TRANSRICE_GL51_05 

RICE_GL5 1_05TGAGTCA70TRANSRICE_GLU2_0 1 RICE_GLU2_0 1 CCTTTCGTGTACC 1 33TRANS 
RICEGLUB 1_0 1 RICEGLUB 1_0 1 CTGAGTCAT9 1 TRANSRICENITRO 1 RICE_NITR_0 1 CACGTC 
AC80.5TRANSRICERAB 1 6A_0 1 RICERAB 1 6A_0 1 TACGTGGCNNNNCCGCCGCGCCT236TRANS 
1 5 RICE RAB 1 6A_03RICE_RAB 1 6 A_03GTACGTGG80. 5TRANSTAF- 
1 ASTAF 1 _0 1 GC A ACGTGGC 1 0 1 .5TRANSTAF- 

1RICE_RAB16B_01GGTACGTGGCG112TRANSWHEAT_H3_01WHEAT_H3_01CCACGTCA80.5TR 

ANSSeed_spSeed_A ACA_motif AACAA ACTCTATC 1 3 31it 1 Seed_spSeed_GCN4GTG AGTCAC9 1 lit 1 Su 
gRepACGTABOXTACGTA601it2SugRepTCmotifTATCCAY701it2Amy3DAMYBOX2TATCCAT701it3 

20 Amy3DGBOXRELOSAMY3CTACGTGGCCA 1 l21it3*Column Headings for Table 2 

Transcription Factor Name : Name of the transcription factor which binds to a 

sequence motif within a promoter region 

Sequence Motif name: N ame of the sequence motif to which the transcription 

factor binds 

25 Sequence Motif: sequences searched to further annotate putative promoters 

Maximum mismatches allowed: Number of mismatches that are permitted 
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when searching for motifs to annotate putative promoters 

Reference for transcription factors and sequence motifs : Motifs and 
transcription factors are found in one of three databases: PLACE, PlantCARE or TRANS 
(respectively, http ://www.dna.affrc.go. jp/htdocs/PLACE/, 
5 http://sphinx.rug.ac.be:8080/PlantCARE/index.htm, 

http://transfac.gbf.de/TRANSFAC/or Yoshihara et al., FEBS Letters 383, 1996, pp 
213-218; or Toyofuku K et al. FEBS Lett 428:275-280 (1998) or lit3 (Huang et al Plant 
Mol Biol 14:655-668 (1990)) 
Table 3 

»3 io Table 3 describes those putative promoter sequences containing TATA boxes, GC 

33 boxes or CCAT boxes as determined by matrix motifs. 

; I *Coluran headings for Table 3 

j» Seq num 

O Provides the SEQ ID NO. for the listed sequences. 

HJ 15 Sea ID 

: . i 

* * * 

p Arbitrarily assigned identifier for each putative promoter sequence 

Start 

Indicates the start position of the TATA box, GC box or CCAAT box. 
End 

20 Indicates the end position of the of the TATA box, GC box or CCAAT box. 

p-Value 

Probability value is determined by simulation as described above. 
- In ( p-Value) 
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Indictates the negative natural log of the p-Value. 
# of hits in cluster 

No clustering is done in Table 3. Therefore, all entries in this column are listed as 

"1". 

Factor Name 

Transcription factors associated with the TATA box (TATA-plant), GC box 
(V_GC_01) and CCAAT box (F_HAP234_01) are listed under this column heading when 
the matrix motifs for the TATA box, GC box and CCAAT box are identified. 

Site name 

List whether the search is done for the TATA box, GC box or CCAAT box. 
Table 4 describes those putative promoter sequences containing specific sequence motifs 

as listed in Table 2. 

Column headings for Table 4 
Seq num 

Provides the SEQ ID NO. for the listed sequences. 
Seq ID 

Arbitrarily assigned identifier for each putative promoter sequence 
Start 

Indicates the start position of a sequence motif. Note that if multiple motifs are 
present in a window of sequence, the start position, the first time it is indicated, lists the 
start position of the first motif in a window. The second or subsequent times a start 
position is listed, this heading refers to the start postion of the subsequent individual 
sequence motif within the window of sequence. 
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End 

Indicates the end position of a sequence motif. Note that if multiple motifs are 
present in a window of sequence, the end position, the first time it is indicated, lists the 
end position of the first motif in a window. The second or subsequent times an end 
position is listed, this heading refers to the end postion of the subsequent individual 
sequence motif within the window. 

Strand 

The strand of genomic DNA on which the sequence motif is located (+/-) 
p-Vahie 

Probability value as determined by simulation as described above. Note that if 
multiple motifs are present in a window of sequence, the p Value, the first time it is 
indicated, lists the p-Value position for all of the motifs in a window of sequence. The 
second or subsequent times a P-value is listed, this heading refers to the p- Value of the 
subsequent individual sequence motif within the window. 

- In (p-Value) 

Indictates the negative natural log of the p- Value as described in the column 
heading above. 

# of hits in cluster 

Clustering is described above. The number of hits in a cluster refer to the number 
of times sequence motifs appear in a window of sequence. 
Factor Name 

Transcription factors associated with the sequence motifs listed in Table 2 are 
included under this column heading. 
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Site name 

Lists the sequence motif for which a search is done on a putative promoter 
sequence. 
Table 5 

Table 5 lists the putative promoter sequences for which sequence or matrix motifs 
are not identified. 

Column headings for Table 5 
Seflnum 

Provides the SEQ ID NO. for the listed sequences. 
Seq ID 

Arbitrarily assigned identifier for each putative promoter sequence 
Table 6 

Table 6 lists the contigs, combigenes and description information associated with 
each gene prediction. 

*Column Headings: 
Seq num 

Provides the SEQ ID NO. for the listed sequences. 
Contig id 

Arbitrarily assigned name for each contig. 
CDS. 

The location of the exons found within the gene as determined by the gene- 
predicting program (Method). 
CG ID 
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Arbitraily assigned name for each combigene. 
CG Start 

Indicates the start position of the combigene gene. 
CG End 

Indicates the end position of the combigene gene. 
Strand 

Indicates the strand location of the gene (+/-). 
Gene 

Indictates an arbitraily assigned gene name based on the method used to predict 

the gene. 

Method 

Indicates the gene-predicting program used. These programs are GenScan, 
AAT/NAP, AAT/GAP, TBLASTX or Genemark.hmm. 
Gene Start 

The start position of the putative gene making up a combigene as predicted by the 
particular gene-predicting program used. 
Gene End 

The end position of the putative gene making up a combigene as predicted by the 
particular gene-predicting program used. 
Hit Score 

The aat_nap score (under Hit score in the rows where the method is AAT/NAP) is 
reported by the nap program in the aat package. It is an alignment score in which each 
match and mismatch is scored based on the BLOSUM62 scoring matrix. The aat_gap 



score (under Hit score in the rows where the method is AAT/GAP) is the alignment score 
for each hit sequence, as reported by AAT/GAP. 

For TBLASTX the Bit score for BLAST match score that is generated by the 
sequence comparison of the genomic contig with the Monsanto cDNA sequence named 
5 under the GI column is listed. The E-value corresponding to a given bit score is E = 
mn2-S\ "m" and "n" are two proteins of length "m" and V, "E" is the E value and S' is 

the bit score. 
GI 

Each sequence in the GenBank public database is arbitrarily assigned a unique 
O 10 NCBI gi (National Center for Biotechnology Information GenBank Identifier) number. 
W In this table, the NCBI gi number which is associated (in the same row) with a given 

S I contig or singleton refers to the particular GenBank sequence which is the best match for 

11 that sequence. If the hit is based on cDNAs from Monsanto's SeqDB, the name of the 

~3 cDNA sequence it hit to is named. 

15 Description 
]~: The Description column provides a description of the NCBI gi referenced in the 

****** 

"GI" column. 
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